Chapter 2 — Data: The Fuel That Powers Everything¶
The midnight call that changed everything¶
You did everything right. The ML team asked for a GPU cluster, and you delivered — eight NVIDIA A100 GPUs across two nodes, connected with high-bandwidth networking, running the latest CUDA drivers. The deployment was textbook. The team kicked off their first training job at 6 PM on a Friday, and you headed home feeling good about a week well spent.
Your phone buzzes at midnight. The lead data scientist sounds frustrated: "The GPUs aren't working. Training that should take four hours hasn't even finished the first epoch." You remote in and pull up the metrics. GPU utilization: 12%. GPU memory usage: barely a third. But then you see it — disk I/O is pegged at 100%, with read throughput crawling at 60 MB/s. The team stored 2 TB of training images on a Standard HDD-backed Blob Storage account mounted via a basic SMB share. Your storage architecture is starving the most expensive hardware in the rack.
This story plays out in organizations every week. Teams pour budgets into GPUs, only to discover that their data pipeline — the part infrastructure engineers own — is the real bottleneck. Data is to AI what fuel is to an engine, but the fuel line matters as much as the fuel itself. This chapter will teach you how to build fuel lines that keep those GPUs fed.
Why everything starts with data¶
Every AI system, from a simple image classifier to a trillion-parameter large language model, depends on three fundamental components working together. You can think of it as a formula:
Data + Model + Compute = AI
Remove any one of these and you have nothing. But here is the insight that most infrastructure engineers miss early on: of those three components, data is the one that touches infrastructure at every single stage. The model is code. Compute is provisioned and (mostly) left running. But data must be ingested, stored, prepared, served for training, and delivered at inference — and every one of those stages is an infrastructure problem.
Infra ↔ AI Translation
| Infrastructure Concept | AI Equivalent | Why It Matters |
|---|---|---|
| Storage account / volume | Training dataset repository | Where raw data lives before the model sees it |
| Read throughput (MB/s) | Data loader speed | Determines how fast GPUs receive training batches |
| IOPS | Samples per second | Small-file workloads (images) need high IOPS |
| Storage tiers (Hot/Cool/Archive) | Data lifecycle stages | Hot for active training, Cool for completed datasets, Archive for compliance |
| Blob container | Dataset partition | Logical grouping of training, validation, and test data |
| NFS/SMB mount | POSIX file access for frameworks | PyTorch and TensorFlow expect filesystem semantics |
| Encryption at rest | Data protection compliance | Required for PII, medical, and financial datasets |
If you already manage storage, networking, and access control, you understand 70% of the AI data stack. What changes is the intensity — AI workloads push read throughput, IOPS, and sequential I/O harder than almost anything you have provisioned before.
Types of data in AI workloads¶
Not all data is created equal, and the type of data directly determines which storage backend and pipeline architecture you need.
| Data Type | Examples | Common AI Use Cases | Storage Implications |
|---|---|---|---|
| Structured | SQL tables, CSVs, spreadsheets | Predictive models, classification, feature stores | SQL Database, Cosmos DB, Parquet on Data Lake |
| Semi-structured | JSON, XML, log files, YAML | Chatbots, behavior analysis, NLP preprocessing | Blob Storage, Data Lake Gen2 |
| Unstructured | Images, video, audio, free text, PDFs | Computer vision, LLMs, speech recognition | Blob Storage, NVMe scratch |
| Temporal | Time series, telemetry, IoT sensor data | Demand forecasting, anomaly detection | Azure Data Explorer, Cosmos DB, Data Lake |
Modern AI is dominated by unstructured data. A single large language model might train on terabytes of text scraped from the web. A computer vision model needs millions of images. A speech-to-text system ingests thousands of hours of audio. All of this is unstructured, all of it is massive, and all of it needs to be read sequentially at high throughput during training.
💡 Pro Tip: When a data science team tells you "we have about 500 GB of training data," assume it will grow to 5 TB within six months. Model experimentation multiplies dataset size through augmentation, versioning, and preprocessing variants. Size your storage account and throughput limits accordingly.
The data lifecycle — Infrastructure at every stage¶
Data does not sit still. It flows through a pipeline with distinct stages, and you own the infrastructure at every one of them.
Ingestion¶
This is where data enters your environment. It might arrive through REST APIs, event streams, batch file uploads, or database replication. Your job is to ensure ingestion is reliable, scalable, and secure.
Key tools: - Azure Event Hubs — Real-time streaming ingestion at millions of events per second - Azure Data Factory — Orchestrated batch pipelines from 90+ data sources - AzCopy — High-performance command-line transfers for bulk data movement
⚠️ Production Gotcha: Ingestion pipelines that work fine with 10 GB fail spectacularly at 10 TB. Always test your ingestion pipeline at 10× your expected data volume. AzCopy with --cap-mbps lets you throttle transfers during business hours and run full-speed overnight.
Storage¶
Once ingested, data needs a home. Your choice of storage backend affects everything downstream — training speed, cost, security, and operational complexity. We will cover storage architecture in depth in the next section.
Key considerations: - Hot tier for datasets under active training - Cool tier for completed experiment data you might revisit - Archive tier for compliance and audit retention - Data Lake Gen2 for analytics-ready hierarchical storage
Preparation and transformation¶
Raw data is rarely model-ready. It needs cleaning, normalization, deduplication, and feature engineering. This stage is compute-intensive and often uses distributed processing frameworks.
Key tools: - Azure Databricks — Spark-based processing for large-scale data preparation - Azure Synapse Analytics — Integrated analytics with serverless SQL and Spark pools - Azure Data Factory — ETL/ELT pipeline orchestration
Infra ↔ AI Translation: Data preparation in AI is the equivalent of ETL in traditional data warehousing — but the volumes are larger and the output format is different. Instead of loading into a SQL warehouse, you are producing Parquet files, TFRecord files, or preprocessed image directories.
Training¶
During training, the model reads the entire dataset multiple times (each pass is called an epoch). A typical training run reads the same data 50-100 times, shuffled differently each time. This means your storage must deliver sustained, high-throughput sequential reads for hours or even days.
What makes training I/O unique: - Sequential reads at multi-GB/s speeds - Random shuffling requires either in-memory shuffling or fast random access - Data loaders in PyTorch and TensorFlow prefetch batches on CPU while the GPU processes the current batch — but only if storage can keep up - Checkpoint writes — models save periodic snapshots (often 1-10 GB each) to recover from failures
Inference¶
At inference time, data flows in the opposite direction — individual requests arrive, the model processes them, and results return. The pattern shifts from throughput to latency: each request must be served in milliseconds, not hours.
Infrastructure patterns: - Low-latency feature stores (Cosmos DB, Redis) for real-time feature lookup - Blob Storage for batch inference input/output - API gateways (Azure API Management) for request routing and throttling
Storage architecture for AI — The decision matrix¶
This is where your infrastructure expertise becomes critical. Choosing the right storage backend is the single most impactful decision you will make for AI workload performance.
Decision Matrix: Storage for AI Workloads
| Storage Type | Best For | Throughput | Latency | Cost | Key Feature |
|---|---|---|---|---|---|
| Blob Storage | Unstructured datasets, model artifacts, checkpoints | Up to 60 Gbps per account | Moderate (ms) | Low (Hot: ~$0.018/GB/mo) | Massive scale, tiered storage |
| Data Lake Gen2 | Analytics-heavy pipelines, structured datasets | Up to 60 Gbps per account | Moderate (ms) | Low | Hierarchical namespace, fine-grained ACLs |
| Local NVMe | Training scratch space, data loader cache | 3-7 GB/s per disk | Ultra-low (μs) | Included with VM | Ephemeral — data lost on deallocation |
| Azure Files (NFS) | Shared datasets across multiple nodes | Up to 10 Gbps (premium) | Low-moderate (ms) | Moderate | POSIX-compliant, multi-node mount |
| Azure Files (SMB) | Legacy compatibility, Windows workloads | Up to 4 Gbps (premium) | Moderate (ms) | Moderate | Windows-native, AD integration |
| Cosmos DB | Feature stores, real-time inference features | N/A (request-based) | Single-digit ms | Higher | Vector search, global distribution |
| SQL Database | Structured feature stores, metadata | N/A (query-based) | Low (ms) | Moderate | ACID compliance, relational queries |
💡 Pro Tip: The most common production pattern is a two-tier approach: store raw datasets in Blob Storage or Data Lake Gen2 for durability and cost, then stage active training data to local NVMe for performance. Think of Blob as your warehouse and NVMe as your workbench.
⚠️ Production Gotcha: Never use Standard HDD-backed storage for training workloads. The IOPS and throughput limits are orders of magnitude below what GPUs require. A single A100 GPU can consume data faster than a Standard HDD account can serve it. Always use Premium or at minimum Standard SSD-backed accounts for active training data.
I/O performance: The hidden bottleneck¶
Here is the counter-intuitive truth about AI infrastructure: the most common reason for low GPU utilization is not a GPU problem — it is a storage problem. When the data loader cannot feed batches to the GPU fast enough, the GPU sits idle waiting for data. This is called data starvation, and it turns your $30,000-per-month GPU cluster into an expensive space heater.
Diagnosing data starvation¶
Watch these metrics: - GPU utilization below 80% during training — almost always a data pipeline issue - Disk I/O at 100% while GPU utilization is low — classic storage bottleneck - Data loader workers maxed out — your CPU preprocessing or I/O is the limit
BlobFuse2: Mounting Blob Storage as a filesystem¶
Most ML frameworks (PyTorch, TensorFlow) expect training data to be accessible through a filesystem path. BlobFuse2 is an open-source virtual file system driver that mounts Azure Blob Storage containers as a local directory on Linux. It translates POSIX file operations into Azure Blob REST API calls.
BlobFuse2 offers two caching modes, and choosing the right one matters:
- File cache (caching mode): Downloads entire files to a local cache directory before serving reads. Best for datasets that are read repeatedly, such as training data across multiple epochs.
- Block cache (streaming mode): Streams data in chunks without downloading the full file. Best for large files where you need to start reading immediately, such as preprocessing or inference on large media files.
Mount with file cache for training workloads:
# Create cache directory on fast local storage (NVMe temp disk)
sudo mkdir -p /mnt/resource/blobfuse2cache
sudo chown $(whoami) /mnt/resource/blobfuse2cache
# Create mount point
sudo mkdir -p /mnt/training-data
# Mount with file cache — uses config.yaml for auth and container settings
sudo blobfuse2 mount /mnt/training-data \
--config-file=./config.yaml \
--tmp-path=/mnt/resource/blobfuse2cache
Preload data before training starts:
BlobFuse2 can download entire containers or subdirectories to the local cache at mount time, so data is ready before training begins:
# Mount with preload — downloads data into cache at mount time
sudo blobfuse2 mount /mnt/training-data \
--config-file=./config.yaml \
--tmp-path=/mnt/resource/blobfuse2cache \
--preload
💡 Pro Tip: Always point --tmp-path to the VM's local NVMe temp disk (/mnt/resource on Azure VMs) rather than the OS disk. This gives BlobFuse2's cache the lowest possible latency. On GPU VMs like the ND-series, the local temp disk can deliver 3-7 GB/s of read throughput.
For full configuration options, including managed identity authentication and streaming mode, see the BlobFuse2 documentation.
AzCopy for bulk data transfer¶
When you need to move large datasets into Azure (or between storage accounts), AzCopy is the fastest option. It supports parallel transfers, automatic retries, and can resume interrupted uploads.
# Login with Microsoft Entra ID
azcopy login
# Copy a local dataset directory to Blob Storage
azcopy copy '/data/training-images' \
'https://<storage-account>.blob.core.windows.net/<container>/training-images' \
--recursive
# Copy between storage accounts (server-side, no local download)
azcopy copy \
'https://<source-account>.blob.core.windows.net/<container>' \
'https://<dest-account>.blob.core.windows.net/<container>' \
--recursive
NVMe scratch space on GPU VMs¶
Azure GPU VMs (ND-series, NC-series) include local NVMe temporary disks that provide extremely high I/O performance. These are ideal for staging training data from Blob Storage before a training run.
Key facts about local NVMe: - Throughput: 3-7 GB/s per disk (some VMs have multiple NVMe disks) - Latency: microseconds, not milliseconds - Ephemeral: All data is lost when the VM is stopped or deallocated - No additional cost — included with the VM
The recommended pattern: use AzCopy or BlobFuse2 preload to stage data from Blob Storage to local NVMe at job start, train from NVMe, then write checkpoints back to Blob Storage for durability.
⚠️ Production Gotcha: "The 12% GPU Utilization Mystery" — If a data scientist reports that their GPU utilization is suspiciously low, check the storage backend before investigating anything else. Nine times out of ten, the issue is one of: (1) training data on Standard HDD, (2) a remote mount without caching, or (3) the BlobFuse2 cache directory pointing to the OS disk instead of the NVMe temp disk. A five-minute storage fix can turn a multi-day training job into an overnight one.
Data security and governance¶
AI workloads handle some of the most sensitive data in your organization — customer records, medical images, financial transactions, proprietary text corpora. Data governance for AI is not optional; it is an infrastructure responsibility that you own.
Data classification¶
Before any data enters a training pipeline, classify it: - Public — Open datasets, published benchmarks, public domain text - Internal — Proprietary business data, internal documents - Confidential — Customer PII, medical records (HIPAA), financial data (PCI-DSS) - Restricted — National security, export-controlled research data
Your storage architecture must enforce these classifications through network isolation, encryption, and access controls.
Encryption¶
- At rest: Azure Storage encrypts all data at rest with 256-bit AES by default. For sensitive workloads, use customer-managed keys stored in Azure Key Vault.
- In transit: All Azure Storage API calls use TLS 1.2+. Enforce minimum TLS versions at the storage account level.
Access control¶
💡 Pro Tip: Always use managed identities + Azure RBAC instead of storage account keys. Keys are static, shareable, and hard to rotate. Managed identities are tied to specific resources, automatically rotated, and auditable. Assign the Storage Blob Data Reader role for training workloads and Storage Blob Data Contributor for pipelines that write checkpoints.
# Assign Storage Blob Data Reader to a VM's managed identity
az role assignment create \
--role "Storage Blob Data Reader" \
--assignee <managed-identity-principal-id> \
--scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
Governance tools¶
- Microsoft Purview — Unified data governance with automated data discovery, classification, and lineage tracking across your entire estate
- Azure Key Vault — Centralized secrets management for storage keys, connection strings, and API keys that pipelines need
- Azure Policy — Enforce storage standards (minimum TLS version, required encryption, allowed SKUs) across all subscriptions
⚠️ Production Gotcha: Data scientists frequently copy training data to local machines, shared drives, or unmanaged storage accounts for "quick experiments." This creates shadow data sprawl that violates compliance requirements. Use Azure Policy to restrict storage account creation and Microsoft Purview to scan for data copies outside approved locations.
Common data architectures in AI¶
Understanding how these components fit together is easier with visual references. Here are two common architecture patterns.
Simple training pipeline¶
In the simplest form, data flows linearly from storage through preprocessing into training:
graph LR
A["Azure Blob Storage<br/>(Raw Data)"] -->|Clean + Transform| B["Preprocessing"]
B -->|Train| C["GPU VM<br/>(Model Training)"]
C -->|Save| D["Trained Model<br/>(Checkpoint)"] This pattern works for small teams running experiments with datasets under 1 TB. Data lives in Blob Storage, a preprocessing script cleans and transforms it, and the training framework reads directly from the processed output.
Full production pipeline¶
Production AI systems add ingestion orchestration, data versioning, model registries, and inference endpoints:
graph LR
A["Event Hubs /<br/>Data Factory<br/>(Ingestion)"] --> B["Data Lake Gen2<br/>(Versioned)"]
B --> C["Databricks /<br/>Synapse<br/>(Processing)"]
C --> D["Azure ML<br/>(Training)"]
D --> E["Model Registry"]
E --> F["Inference<br/>Endpoint"]
D --> G["Blob Storage<br/>(Checkpoints)"] In this architecture, data is ingested through Event Hubs or Data Factory, stored in Data Lake Gen2 with hierarchical namespaces for organization, processed through Databricks or Synapse pipelines, and versioned for reproducibility. Training reads from the versioned dataset, writes checkpoints to Blob Storage, and registers completed models for deployment to inference endpoints.
Infra ↔ AI Translation: If you have built CI/CD pipelines before, a production ML pipeline is the same concept — but instead of code artifacts, you are moving data artifacts through build stages. The "source code" is the dataset, the "build" is training, and the "deployment" is model serving.
Hands-on: Working with Blob Storage for AI data¶
Let's walk through creating a storage account optimized for AI workloads using the Azure CLI. Every command uses --auth-mode login for Microsoft Entra ID authentication — no storage keys needed.
Step 1: Create a resource group¶
# Define variables — replace with your values
RESOURCE_GROUP="rg-ai-training"
LOCATION="eastus2"
STORAGE_ACCOUNT="staitraining$(openssl rand -hex 4)"
# Create the resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
Step 2: Create a storage account with Data Lake Gen2¶
Enabling hierarchical namespace gives you Data Lake Gen2 capabilities — fine-grained ACLs, directory-level operations, and optimized analytics performance.
# Create storage account with hierarchical namespace (Data Lake Gen2)
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_LRS \
--kind StorageV2 \
--enable-hierarchical-namespace true \
--min-tls-version TLS1_2 \
--allow-blob-public-access false
# Verify creation
az storage account show \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--query "{name:name, kind:kind, hns:isHnsEnabled, location:primaryLocation}" \
--output table
Step 3: Assign RBAC permissions¶
# Get your signed-in user's object ID and assign Blob Data Contributor role
az ad signed-in-user show --query id -o tsv | az role assignment create \
--role "Storage Blob Data Contributor" \
--assignee @- \
--scope "/subscriptions/$(az account show --query id -o tsv)/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Storage/storageAccounts/$STORAGE_ACCOUNT"
Note: Role assignments may take 1-2 minutes to propagate. Wait briefly before proceeding.
Step 4: Create a container and upload data¶
# Create a container for training data
az storage container create \
--account-name $STORAGE_ACCOUNT \
--name training-data \
--auth-mode login
# Upload a sample file
echo '{"sample": "training data", "label": 1}' > sample.json
az storage blob upload \
--account-name $STORAGE_ACCOUNT \
--container-name training-data \
--name datasets/sample.json \
--file sample.json \
--auth-mode login
Step 5: List and download data¶
# List blobs in the container
az storage blob list \
--account-name $STORAGE_ACCOUNT \
--container-name training-data \
--auth-mode login \
--output table
# Download a blob
az storage blob download \
--account-name $STORAGE_ACCOUNT \
--container-name training-data \
--name datasets/sample.json \
--file downloaded-sample.json \
--auth-mode login
# Verify the download
cat downloaded-sample.json
Step 6: Bulk upload with AzCopy (for larger datasets)¶
# Login to AzCopy with Entra ID
azcopy login
# Upload an entire directory recursively
azcopy copy './local-dataset/' \
"https://${STORAGE_ACCOUNT}.blob.core.windows.net/training-data/v1/" \
--recursive
For complete Blob Storage CLI reference, see the Azure Storage CLI quickstart.
Chapter checklist¶
I/O is the hidden bottleneck — Low GPU utilization during training is almost always a storage problem, not a GPU problem.
Match storage to workload — Use Blob/Data Lake Gen2 for durable storage, local NVMe for training scratch space, and Cosmos DB for low-latency inference features.
Use BlobFuse2 wisely — Mount Blob Storage as a filesystem with file cache mode for training. Always point the cache to the VM's local NVMe temp disk.
Stage data for training — Copy datasets from Blob Storage to local NVMe before training starts. Write checkpoints back to Blob for durability.
Never use Standard HDD for training — The throughput gap between Standard HDD and NVMe is 100× or more. Premium storage or local NVMe is required for GPU workloads.
Secure by default — Use managed identities and RBAC instead of storage keys. Classify data before it enters any pipeline.
Plan for growth — Training datasets multiply through augmentation, versioning, and experimentation. Size your storage for 10× current needs.
Use AzCopy for bulk transfers — It is the fastest way to move data into, out of, and between Azure Storage accounts.
What's next¶
Now that you understand how data flows through AI systems and why your storage architecture decisions directly determine model training performance, it is time to look at the compute layer that consumes all that data. In Chapter 3 — Compute: Where Intelligence Comes to Life, you will learn about GPU architectures, VM families, cluster design, and the networking that ties them together — and why a well-tuned storage layer is only half the equation.