Cloud GPU Options Compared: AWS vs GCP vs Azure for ML
You've got a massive machine learning project ahead. Maybe you're training a 7B parameter language model. Maybe you need to fine-tune a vision transformer on your proprietary dataset. Either way, you're staring at the cloud ecosystem and wondering: Where should I actually run this thing?
AWS, Google Cloud, and Microsoft Azure all offer serious GPU horsepower. The prices have shifted dramatically in the last year. Spot instance strategies that worked in 2024 might leave you broke in 2026. And distributed training-pipelines-training-orchestration)-fundamentals)) across GPUs? That's where networking performance becomes your secret weapon.
This article cuts through the marketing noise. We'll compare actual hardware, actual prices (as of February 2026), and show you a Python calculator so you can model your own costs. By the end, you'll know exactly which cloud fits your workload.
Table of Contents
- The GPU Landscape: What You're Actually Getting
- AWS P-Series: The Workhorses
- GCP A3 Series: The Dark Horse
- Azure GPU Options: The Expensive but Reliable Choice
- The Comparison Table: Hard Numbers
- GPU Networking: The Hidden Performance Layer
- AWS: NVSwitch (P5 Only)
- GCP: Custom Ethernet + SmartNIC
- Azure: Ethernet + InfiniBand
- Spot / Preemptible Instances: Risk vs. Reward
- AWS Spot Interruption Rates
- GCP Preemptible: The 24-Hour Rule
- Azure Spot: The Cheapest Option
- Checkpoint-Restart Pattern (Python)
- Managed ML Services: SageMaker vs. Vertex AI vs. Azure ML
- AWS SageMaker
- GCP Vertex AI
- Azure ML
- Real Cost Scenario: Training a 7B LLM on 64 A100s for 2 Weeks
- Ingredient Breakdown
- AWS Scenario (p4d.24xlarge)
- GCP Scenario (a3-highgpu-8g)
- Azure Scenario (NDH100v5 Spot)
- Cost Summary
- GPU Instance Selection Decision Tree
- Python Cost Calculator: Customize for Your Workload
- Decision Framework: Which Cloud for Your Workload?
- The Operational Reality: What Pricing Tables Don't Tell You
- Multicloud Strategies: Hedging Your Bets
- Wrapping Up: The Real Cost of ML at Scale
- Negotiating with Cloud Providers: Getting Better Rates
- The Strategic Multicloud Approach
The GPU Landscape: What You're Actually Getting
Let me be direct: not all GPUs are created equal. The cloud providers each have a lineup, and they've optimized them for different scenarios. Understanding the hardware hierarchy is crucial because upgrading from a L4 to an A100 isn't just a 2x performance difference - it's more like 10-15x for large-scale training.
AWS P-Series: The Workhorses
AWS offers three main GPU instance families for ML, each optimized for different compute density:
P5 Series (NVIDIA H100 Tensor-pipeline-automated-model-compression)-parallelism) Core GPUs):
- p5.4xlarge: 4 × H100 GPUs, 144GB GPU memory, $6.88/hour on-demand
- p5.48xlarge: 8 × H100 GPUs (NVSwitch interconnect), 1,152GB CPU memory, $39.80/hour as of January 2026
P4d Series (NVIDIA A100 Tensor Core GPUs):
- p4d.24xlarge: 8 × A100 (80GB), 768GB CPU memory, $32.77/hour
- NVSwitch provides 600 GB/s GPU-to-GPU bandwidth
G5 Series (NVIDIA L40 or L40S):
- g5.12xlarge: 4 × L40S GPUs, $7.32/hour
- Best for inference, fine-tuning, not large-scale training
Key takeaway: AWS P5 instances just dropped 45% in June 2025, making them competitive again after years of premium pricing. This is significant - it reshuffles the entire economics of distributed training-ddp-advanced-distributed-training) on AWS.
The pricing history matters because cloud GPU costs shift. A year ago, GCP was significantly cheaper. Now AWS is competitive again. Plan your infrastructure expecting these shifts.
GCP A3 Series: The Dark Horse
Google Cloud's A3 instances use the same H100 GPUs as AWS but in different packaging. Google's infrastructure engineering gives them unique advantages in networking reliability, even without NVSwitch.
A3 High-GPU (a3-highgpu-8g):
- 8 × H100 80GB GPUs, $88.49/hour on-demand
- Fast interconnect (3200 Gbps effective)
- No NVSwitch, but Google's custom networking is exceptionally reliable
A3 Mega (a3-megagpu-8g):
- 16 × H100 GPUs (two instances networked), extreme scale
- $176.98/hour base, but often used with Committed Use Discounts (CUD)
Bonus: GCP's preemptible instances cost $2.25/GPU-hour for A3-High. That's roughly 75% off on-demand. The catch? Interruption risk, but more on that later. For many workloads, this is the best-kept secret in cloud GPU pricing.
Key takeaway: GCP's committed use discounts can drive long-term costs below AWS. Preemptibles are a steal if your workload tolerates interruptions. This changes the entire ROI calculation for long-running training jobs.
Azure GPU Options: The Expensive but Reliable Choice
Azure's GPU lineup includes:
ND H100 v5 Series (NDH100v5):
- Up to 8 × H100 GPUs per instance
- Horovod/MPI-optimized networking
- Roughly $6.98/GPU-hour base in East US
- Spot instances available at ~$1.22/GPU-hour (huge discount)
NC-series (NCv3, NC_A100_v4):
- A100 and older V100 options
- More cost-effective for smaller workloads
- Less bandwidth between GPUs
NC-A100-v4:
- 4 or 8 × A100 GPUs per instance
- Competitive on price with older GPU hardware
Key takeaway: Azure's spot instances are criminally cheap, but on-demand pricing remains the highest among the three. Use spots or committed discounts. Full-price on-demand on Azure is economically unsustainable for serious GPU workloads-nvidia-kai-scheduler-gpu-job-scheduling)-ml-gpu-workloads).
The Comparison Table: Hard Numbers
Here's what you're actually paying (as of February 2026, USD, us-east/us-central-1 regions). These prices shift quarterly, so treat this as directional:
| Provider | Instance Type | GPUs | GPU Memory | On-Demand/hr | Spot/Preemptible/hr | 30-day Cost (24/7) |
|---|---|---|---|---|---|---|
| AWS | p5.48xlarge | 8 × H100 | 640GB | $39.80 | ~$9.95 | $28,750 |
| AWS | p4d.24xlarge | 8 × A100 (80GB) | 640GB | $32.77 | ~$8.19 | $23,600 |
| GCP | a3-highgpu-8g | 8 × H100 | 640GB | $88.49 | $22.12 (preemptible) | $63,873 |
| GCP | a3-highgpu-8g | 8 × H100 | 640GB | N/A | $2.25/GPU-hr | $3,888 (preemptible only) |
| Azure | NDH100v5 (8 GPU) | 8 × H100 | 640GB | ~$55.84 | ~$6.81 | $40,203 |
| Azure | NC A100 v4 | 4 × A100 | 320GB | ~$21.92 | ~$2.68 | $15,784 |
Notice something? GCP's on-demand price looks absurd - but their CUDs and preemptibles flip the economics. Azure spot is unbeatable for interruption-tolerant workloads. AWS offers the middle ground with good pricing and ecosystem maturity.
The real decision isn't which cloud is "cheapest" - it's which cloud is cheapest for your specific workload. A 2-week training job? GCP preemptibles win. Long-term committed capacity? GCP CUDs or Azure reserved instances. Instant need with no interruption tolerance? AWS on-demand or GCP on-demand.
GPU Networking: The Hidden Performance Layer
You've got 8 H100s in your instance. Great. But how fast can they talk to each other?
This matters enormously for distributed training-zero-memory-efficient-training)-comparison)-zero-memory-efficient-training). A 7B language model distributed across 8 GPUs needs to synchronize gradients constantly. If your inter-GPU communication is slow, you're leaving compute on the table. A 10% networking penalty on an $40/hour instance costs you $4/hour in wasted capacity. Multiply that by 30 days and 5 instances, and you've lost thousands in efficiency.
AWS: NVSwitch (P5 Only)
P5 instances include NVIDIA's NVSwitch interconnect, delivering:
- 600 GB/s full-duplex bandwidth between any two GPUs
- All-to-all connectivity (no bottlenecks)
- Seamless scaling to 8+ GPUs
This is why P5 dominates for massive distributed training. NVSwitch is the gold standard. When))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) you need to coordinate gradients across dozens of GPUs with minimal latency, NVSwitch's hardware guarantees are irreplaceable.
GCP: Custom Ethernet + SmartNIC
GCP doesn't use NVSwitch, but they've built something equally smart:
- 3200 Gbps effective inter-GPU bandwidth
- Custom networking stack (low-latency)
- Horovod integration (native MPI support)
In practice, GCP trades raw bandwidth for lower latency and more predictable performance. For most LLM training, the difference is negligible. You'll see 5-10% performance differences in specific workloads, not 50%.
Azure: Ethernet + InfiniBand
Azure uses InfiniBand (200 Gbps) or Ethernet depending on the region:
- InfiniBand: 200 Gbps (better), available in limited regions
- Ethernet: 25-100 Gbps (still fast, but slower than AWS/GCP)
Azure's networking is reliable but slower. This affects distributed training efficiency, especially at scale. A 64-GPU job on Azure might take 12% longer than on AWS, not because of compute, but because of gradient synchronization overhead.
Bottom line: For 64-GPU training clusters, AWS P5 wins on raw networking. GCP is a close second. Azure requires more careful placement and job scheduling. If you're training a 7B model on 8 GPUs, these differences barely matter. At 64+ GPUs, they become significant.
Spot / Preemptible Instances: Risk vs. Reward
The ultimate money hack? Spot instances. These are spare cloud capacity sold at 60-90% discount. The trade-off: they can be interrupted with minimal warning. But with proper checkpointing, interruptions become just another operational cost.
AWS Spot Interruption Rates
AWS publishes real interruption metrics. Here's what's typical:
- p5.4xlarge: 2-5% hourly interruption rate
- p4d.24xlarge: 1-3% hourly interruption rate
- g5.12xlarge: 0.5-2% hourly interruption rate
For 24-7 training, a 3% hourly rate means you'll lose the instance ~17 times per month. Oof. That's 17 restarts. With proper checkpointing, each restart costs you maybe 30 minutes of recomputation (the last batch before preemption). So you're losing about 8.5 hours per month to interruptions - roughly 1.2% of your compute capacity.
GCP Preemptible: The 24-Hour Rule
GCP preemptible instances have a built-in safety: they'll never interrupt you in the first 24 hours. After 24 hours, they can be terminated with 30 seconds notice.
This changes the game. You can:
- Request a preemptible instance
- Run uninterrupted for 24 hours
- Set up auto-restart with checkpointing
- Rinse and repeat
For a 2-week training job, you'd expect 14 preemption events (one per day). Manageable, especially with proper checkpointing. And at $2.25/GPU-hour vs $88.49/hour on-demand, that 97% discount more than pays for the operational complexity.
Azure Spot: The Cheapest Option
Azure's spot pricing is unbeatable - $1.22/GPU-hour for H100. But the tradeoff is similar to AWS: interruptions can happen anytime, sometimes with little warning.
Azure's mitigation? Spot eviction notices give you ~2 minutes before termination. Shorter than AWS, but workable. Set up graceful shutdown with proper checkpoint saving, and you can handle it.
Checkpoint-Restart Pattern (Python)
Here's how to survive preemptions. This pattern saves your training state every N steps, so interruptions cost you at most N steps of recomputation:
import torch
import json
from pathlib import Path
class CheckpointManager:
def __init__(self, checkpoint_dir="./checkpoints", interval=500):
self.checkpoint_dir = Path(checkpoint_dir)
self.checkpoint_dir.mkdir(exist_ok=True)
self.interval = interval
self.step = 0
def save_checkpoint(self, model, optimizer, epoch, step):
"""Save model, optimizer state, and metadata."""
checkpoint = {
'epoch': epoch,
'step': step,
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
}
ckpt_path = self.checkpoint_dir / f"checkpoint-epoch{epoch}-step{step}.pt"
torch.save(checkpoint, ckpt_path)
# Keep metadata for resumption
metadata = {
'last_epoch': epoch,
'last_step': step,
'last_checkpoint': str(ckpt_path)
}
with open(self.checkpoint_dir / "latest.json", "w") as f:
json.dump(metadata, f)
print(f"Checkpoint saved: {ckpt_path}")
def load_latest(self, model, optimizer):
"""Resume from the most recent checkpoint."""
metadata_file = self.checkpoint_dir / "latest.json"
if not metadata_file.exists():
print("No checkpoint found, starting fresh.")
return 0, 0
with open(metadata_file, "r") as f:
metadata = json.load(f)
ckpt_path = metadata['last_checkpoint']
checkpoint = torch.load(ckpt_path)
model.load_state_dict(checkpoint['model_state'])
optimizer.load_state_dict(checkpoint['optimizer_state'])
print(f"Resumed from {ckpt_path}")
return metadata['last_epoch'], metadata['last_step']
# Usage in training loop
model = YourModel()
optimizer = torch.optim.AdamW(model.parameters())
manager = CheckpointManager(interval=500)
start_epoch, start_step = manager.load_latest(model, optimizer)
for epoch in range(start_epoch, num_epochs):
for step, batch in enumerate(dataloader, start=start_step):
# Training step
logits = model(batch)
loss = criterion(logits, batch['labels'])
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Save checkpoint every N steps
if (step + 1) % manager.interval == 0:
manager.save_checkpoint(model, optimizer, epoch, step + 1)This code saves your model and optimizer state every 500 steps. If your instance gets preempted, you resume from the latest checkpoint - losing only ~5 minutes of compute. That's the operational cost of using preemptibles: write once, save forever.
Managed ML Services: SageMaker vs. Vertex AI vs. Azure ML
Want someone else to manage the cluster? Cloud providers offer managed training platforms that handle orchestration, checkpointing, and scaling for you.
AWS SageMaker
SageMaker handles distributed training orchestration:
- Training Jobs: Spin up instances, run code, shut down
- Spot Instance Integration: Built-in support, automatic checkpointing for Spot interruptions
- Hyperparameter Tuning: Parallel hyperparameter search
- Pricing: You pay for the underlying EC2 instances (p5, p4d, etc.) + a small service fee (~2%)
For a p5.48xlarge 2-week job:
- EC2 cost: ~$28,750/month on-demand
- SageMaker overhead: ~$575
- Total: ~$29,325 (with Spot: ~$7,500 + overhead)
SageMaker's value isn't in cost - it's in operational simplicity. You don't manage instances; SageMaker does.
GCP Vertex AI
Vertex AI is Google's managed ML platform:
- Training: Native distributed training for TensorFlow, PyTorch, custom code
- Preemptible Native: Built-in preemptible instance support with auto-restart
- Custom Containers: Bring your own training code in a Docker container
- Pricing: You pay for the underlying Compute Engine instances + 10-15% service overhead
For the same 8 × H100 job:
- GCP A3-High preemptible cost: ~$3,888 (14 days, ~$2.25/GPU-hr)
- Vertex AI overhead: ~$390
- Total: ~$4,278 (dramatically cheaper due to preemptibles)
Or with on-demand + CUD:
- CUD (3-year): ~50% discount on on-demand
- Effective cost: ~$30,000/month, but much lower with multi-year commitment
Azure ML
Azure's managed platform:
- ML Compute: GPU clusters with auto-scale
- Spot Support: Built-in interruption handling
- MLflow Integration: Experiment tracking
- Pricing: Pay for underlying VMs + service cost
For NDH100v5 (8 GPU) on spot:
- VM cost: ~$1,956 (14 days at $6.81/hr spot)
- Azure ML overhead: ~$200
- Total: ~$2,156 (cheapest option!)
But: Azure's on-demand pricing is highest, so don't use Azure for non-interruptible work unless you're committed long-term.
Real Cost Scenario: Training a 7B LLM on 64 A100s for 2 Weeks
Let's model a realistic scenario: training your proprietary 7B parameter language model on 64 A100 80GB GPUs for 2 weeks. This is typical for fine-tuning workflows.
Ingredient Breakdown
- Training compute: 64 GPUs × 336 hours (2 weeks) = 21,504 GPU-hours
- Data egress: Assume 500GB uploaded, 2TB downloaded = $100-200
- Storage: 1TB model checkpoints + training data = $15-30/month
- Networking overhead: Distributed training, cross-region = $50-100
AWS Scenario (p4d.24xlarge)
We'd need 8 × p4d.24xlarge instances (64 GPUs total):
- Instance cost: 8 instances × $32.77/hr × 336 hours = $88,537
- Spot savings (if 3% hourly interruption): ~$22,000 (Spot rate ~$8.19/hr)
- Effective cost with Spot: ~$66,537
- Egress + storage: ~$200
- Total: ~$66,737
GCP Scenario (a3-highgpu-8g)
We'd need 8 × a3-highgpu-8g instances:
Option A: On-Demand
- Instance cost: 8 instances × $88.49/hr × 336 hours = $238,009
- 3-year CUD discount (50% off): $119,004
- Egress + storage: ~$200
- Total: ~$119,204
Option B: Preemptible (More Realistic)
- Instance cost: 8 instances × $2.25/GPU-hr (64 GPUs) × 336 hours = $48,384
- Preemption overhead (14 restart events, each costing ~30 min recompute): ~$3,000
- Egress + storage: ~$200
- Total: ~$51,584 (but requires interrupt-tolerant job!)
Azure Scenario (NDH100v5 Spot)
We'd need 8 × NDH100v5 instances (8 GPUs each):
- Spot cost: 8 instances × $6.81/hr × 336 hours = $18,380
- Egress + storage: ~$200
- Eviction overhead (assuming 5% eviction rate): ~$1,000
- Total: ~$19,580 (absolute cheapest!)
Cost Summary
| Cloud | Strategy | 2-Week Cost |
|---|---|---|
| AWS | On-Demand | $88,737 |
| AWS | Spot (with overhead) | $66,737 |
| GCP | On-Demand + CUD | $119,204 |
| GCP | Preemptible | $51,584 |
| Azure | Spot | $19,580 |
Azure's spot pricing is unbeatable for tolerant workloads. But remember: interruptions require solid engineering. GCP preemptible is the sweet spot: cheap AND predictable (24-hour guarantee). AWS on-demand is the safest bet if you can't afford interruptions.
GPU Instance Selection Decision Tree
Here's how to pick your instance type:
graph TD
A["Need GPU instances?"] -->|No| B["Use CPU instances (cheaper)"]
A -->|Yes| C{"Workload type?"}
C -->|Inference only| D["Use L4/L40 instances<br/>AWS G5, GCP L4, Azure GA"]
C -->|Fine-tuning| E["Use A100 instances<br/>Cost-effective training"]
C -->|Large-scale training| F{"Scale?"}
F -->|Single machine<br/>8-16 GPUs| G{"Interrupt<br/>tolerant?"}
F -->|64+ GPUs| H{"Distributed<br/>networking critical?"}
G -->|Yes| I["GCP preemptible A3<br/>or Azure Spot<br/>75% discount"]
G -->|No| J["AWS P4d on-demand<br/>or GCP CUD<br/>Reliable + fast"]
H -->|Yes| K["AWS P5<br/>NVSwitch<br/>600GB/s"]
H -->|No| L["GCP A3 or<br/>Azure ND<br/>Still fast enough"]Python Cost Calculator: Customize for Your Workload
Let's build a tool you can actually use. Here's a Python script to model your specific training scenario:
#!/usr/bin/env python3
"""
Cloud GPU cost calculator for ML training workloads.
Customize with your parameters and run to get per-cloud cost estimates.
"""
from dataclasses import dataclass
from typing import Dict
@dataclass
class InstanceConfig:
name: str
provider: str
gpu_count: int
gpu_memory_gb: int
hourly_on_demand: float
hourly_spot: float
inter_gpu_bandwidth_gbps: int
# Define instance configurations (Feb 2026 pricing)
INSTANCES = {
'aws_p5.48xlarge': InstanceConfig(
name='AWS p5.48xlarge',
provider='AWS',
gpu_count=8,
gpu_memory_gb=640,
hourly_on_demand=39.80,
hourly_spot=9.95,
inter_gpu_bandwidth_gbps=600
),
'aws_p4d.24xlarge': InstanceConfig(
name='AWS p4d.24xlarge',
provider='AWS',
gpu_count=8,
gpu_memory_gb=640,
hourly_on_demand=32.77,
hourly_spot=8.19,
inter_gpu_bandwidth_gbps=600
),
'gcp_a3_highgpu': InstanceConfig(
name='GCP a3-highgpu-8g',
provider='GCP',
gpu_count=8,
gpu_memory_gb=640,
hourly_on_demand=88.49,
hourly_spot=2.25 * 8, # $2.25 per GPU
inter_gpu_bandwidth_gbps=3200
),
'azure_ndh100v5': InstanceConfig(
name='Azure NDH100v5 (8 GPU)',
provider='Azure',
gpu_count=8,
gpu_memory_gb=640,
hourly_on_demand=55.84,
hourly_spot=6.81 * 8, # ~$6.81/hr per GPU
inter_gpu_bandwidth_gbps=200
),
}
def calculate_job_cost(
instance_config: InstanceConfig,
num_instances: int,
hours: float,
use_spot: bool = False,
egress_gb: float = 2000,
storage_gb: float = 1000,
monthly_storage_cost: float = 23,
interruption_rate: float = 0.03
) -> Dict[str, float]:
"""
Calculate total cost for a distributed training job.
Args:
instance_config: Instance configuration
num_instances: Number of instances to run
hours: Training duration in hours
use_spot: Whether to use spot/preemptible instances
egress_gb: Data egress in GB
storage_gb: Storage needed in GB
monthly_storage_cost: Cost per 1000 GB per month
interruption_rate: Hourly interruption rate for spot instances
Returns:
Dictionary with cost breakdown
"""
hourly_rate = instance_config.hourly_spot if use_spot else instance_config.hourly_on_demand
# Compute cost
compute_cost = num_instances * hourly_rate * hours
# Interruption penalty (recompute cost)
interruption_penalty = 0
if use_spot and interruption_rate > 0:
expected_interruptions = (hours / 24) * interruption_rate * 24
interruption_penalty = expected_interruptions * (num_instances * hourly_rate * 1) # Assume 1-hour recompute per interruption
# Data egress cost (~$0.10/GB)
egress_cost = egress_gb * 0.10
# Storage cost
duration_months = hours / 730 # Approximate hours per month
storage_cost = (storage_gb / 1000) * monthly_storage_cost * duration_months
# Total
total_cost = compute_cost + interruption_penalty + egress_cost + storage_cost
return {
'compute_cost': compute_cost,
'interruption_penalty': interruption_penalty,
'egress_cost': egress_cost,
'storage_cost': storage_cost,
'total_cost': total_cost,
'cost_per_gpu_hour': total_cost / (instance_config.gpu_count * num_instances * hours)
}
# Example: 64 GPU training job for 2 weeks
print("=" * 80)
print("SCENARIO: 64-GPU Training Job (2 weeks)")
print("=" * 80)
# AWS P4d: 8 instances (64 GPUs)
aws_p4d_result = calculate_job_cost(
INSTANCES['aws_p4d.24xlarge'],
num_instances=8,
hours=336,
use_spot=True,
interruption_rate=0.03
)
print("\nAWS p4d.24xlarge (Spot):")
print(f" Compute: ${aws_p4d_result['compute_cost']:,.2f}")
print(f" Interruption penalty: ${aws_p4d_result['interruption_penalty']:,.2f}")
print(f" Egress: ${aws_p4d_result['egress_cost']:,.2f}")
print(f" Total: ${aws_p4d_result['total_cost']:,.2f}")
print(f" $/GPU-hour: ${aws_p4d_result['cost_per_gpu_hour']:.4f}")
# GCP A3: 8 instances (64 GPUs), preemptible
gcp_result = calculate_job_cost(
INSTANCES['gcp_a3_highgpu'],
num_instances=8,
hours=336,
use_spot=True,
interruption_rate=0.04 # 1 per day
)
print("\nGCP a3-highgpu-8g (Preemptible):")
print(f" Compute: ${gcp_result['compute_cost']:,.2f}")
print(f" Interruption penalty: ${gcp_result['interruption_penalty']:,.2f}")
print(f" Egress: ${gcp_result['egress_cost']:,.2f}")
print(f" Total: ${gcp_result['total_cost']:,.2f}")
print(f" $/GPU-hour: ${gcp_result['cost_per_gpu_hour']:.4f}")
# Azure ND: 8 instances (64 GPUs), spot
azure_result = calculate_job_cost(
INSTANCES['azure_ndh100v5'],
num_instances=8,
hours=336,
use_spot=True,
interruption_rate=0.05 # ~5% per hour
)
print("\nAzure NDH100v5 (Spot):")
print(f" Compute: ${azure_result['compute_cost']:,.2f}")
print(f" Interruption penalty: ${azure_result['interruption_penalty']:,.2f}")
print(f" Egress: ${azure_result['egress_cost']:,.2f}")
print(f" Total: ${azure_result['total_cost']:,.2f}")
print(f" $/GPU-hour: ${azure_result['cost_per_gpu_hour']:.4f}")
print("\n" + "=" * 80)
print("WINNER (for interrupt-tolerant workloads): Azure Spot")
print("RUNNER-UP (balanced cost + reliability): GCP Preemptible")
print("SAFEST (no interruptions): AWS P4d On-Demand or GCP with CUD")
print("=" * 80)Run this script and customize the parameters. Plug in your actual:
- Training duration
- Number of GPUs needed
- Data transfer volumes
- Interruption tolerance
Decision Framework: Which Cloud for Your Workload?
Choose AWS if:
- You need absolute reliability (no interruptions)
- You're training at massive scale (100+ GPUs) and need NVSwitch
- You already use AWS for infrastructure (ecosystem consistency)
- You prefer established, proven MLOps tooling
Choose GCP if:
- You can tolerate 24-hour preemption cycles (guaranteed no interruption per day)
- You want the cheapest total cost of ownership with managed ML
- You need committed use discounts for predictable, long-term workloads
- Your team loves Python and TensorFlow (native integration)
Choose Azure if:
- Cost is your only variable (spot pricing unbeatable)
- You're already deep in the Microsoft ecosystem
- You need on-demand reliability AND spot availability (Azure's stronger reserved instance ecosystem helps)
The Operational Reality: What Pricing Tables Don't Tell You
Let me be honest about something the pricing tables obscure: the total cost of ownership goes way beyond)) hourly instance rates. You're paying for networking, storage, data transfer, and most importantly, operational overhead.
When you use spot or preemptible instances at massive scale, you're introducing complexity. You need robust checkpointing. You need monitoring and alerting for preemption events. You need playbooks for automated restart. For a team of five, managing one spot instance is trivial. For a team managing fifty concurrent spot jobs across multiple regions, it becomes a full-time operational burden. We've seen teams save twenty percent on compute costs by switching to spot, then lose fifty percent of that savings to the engineer-hours spent managing interruptions and debugging failures.
Data transfer costs are often overlooked. If you're training a large model, you're moving terabytes of data in and out of cloud storage. AWS charges thirteen cents per GB for egress. That adds up fast. GCP is slightly cheaper at twelve cents. Azure is ten cents, which matters if you're moving dozens of terabytes. For a 2-week training job moving 10TB out, you're looking at 1200 dollars to 1300 dollars in egress costs alone. This isn't reflected in the comparison tables, but it's very real.
Storage costs also compound. You're not just paying for the model checkpoints you save - you're paying for temporary storage during training, intermediate artifacts, tensorboard logs, and everything else that gets written to disk. A petabyte-month of storage at AWS S3 standard is 23000 dollars. For a large training job generating lots of diagnostic data, this matters.
Then there's the human factor. Running GPU infrastructure requires someone on call. Someone needs to monitor utilization, catch runaway jobs, debug failures. That's not free. Managed services like SageMaker and Vertex AI offload this burden to the cloud provider, which is why they're worth the overhead cost even though they're technically more expensive per compute unit.
Multicloud Strategies: Hedging Your Bets
Smart organizations don't bet everything on one cloud provider. They deploy across AWS, GCP, and Azure strategically. Here's how we see mature teams approach this.
For training workloads, they use GCP preemptibles as the primary compute because of cost, with AWS P4d instances as a fallback-fallback) for jobs that can't tolerate interruptions. For inference, they use AWS on-demand for latency-sensitive applications, GCP committed discounts for baseline capacity, and Azure spot for bursty workloads. This diversification protects against regional outages, handles different workload characteristics with appropriately priced hardware, and creates negotiating leverage with cloud providers.
The downside is operational complexity. Your CI/CD needs to be cloud-agnostic. Your container images need to work across AWS, GCP, and Azure. Your monitoring and logging need to aggregate across providers. Your data replication needs to account for geo-distribution. But the benefits - cost optimization, reliability, and strategic flexibility - often justify the engineering investment.
Wrapping Up: The Real Cost of ML at Scale
You've now got the full picture. The cloud GPU market is competitive, pricing evolves monthly, and the "best" choice depends entirely on your tolerance for interruptions, your duration commitment, and your operational maturity.
Let's synthesize the key insights. AWS excels at scale with NVSwitch providing unmatched inter-GPU bandwidth, making it the clear choice for ultra-large training clusters where every percentage point of efficiency matters. The P5 infrastructure represents the frontier of distributed GPU performance, though that leadership comes at a cost that newer teams might not be able to justify. GCP offers a fundamentally different value proposition, particularly through preemptible instances and committed use discounts that can drive total cost of ownership below AWS for the right workload profile. If your training job can tolerate daily interruptions, GCP's economics are almost unbeatable. Azure rounds out the comparison with the most aggressive spot pricing, making it ideal for cost-optimized workloads run by teams with mature interruption-handling infrastructure.
Key takeaways:
- AWS P5 with NVSwitch is king for ultra-large-scale distributed training requiring high inter-GPU bandwidth.
- GCP preemptible A3 offers the best balance of cost and predictability for tolerant workloads.
- Azure spot is unbeatable on raw price if you can handle frequent interruptions.
- Managed ML services (SageMaker, Vertex, Azure ML) add 2-10% overhead but handle orchestration, checkpointing, and auto-scaling for you.
- Checkpoint-restart patterns are non-negotiable if you use spot/preemptible instances.
- Data transfer and storage costs are often larger than you expect - factor them into your TCO analysis.
- Multicloud strategies hedge your bets and optimize for specific workload types.
The winning strategy depends on your situation. Teams just starting with GPU training should probably begin with GCP preemptibles or Azure spot to minimize spend while they learn the operational patterns. As you scale and develop operational maturity, AWS on-demand or committed discounts become more cost-effective. Long-term, mature teams often use all three clouds, routing workloads intelligently to minimize total cost.
Use the Python calculator to model your specific workload, but don't stop there. Run small pilot jobs on each cloud. Measure actual costs, not just hourly rates. Account for data transfer, storage, and operational overhead. Benchmark end-to-end training time, not just compute time. And remember: the cheapest GPU isn't the one that costs the least per hour - it's the one that trains your model fastest while staying within budget, with operational overhead you can actually manage.
The cloud GPU market will continue evolving. Prices shift. New instance types launch. Your job is to understand the architecture deeply enough that you can adapt to changes, not just chase whoever has the lowest advertised rate this month.
Negotiating with Cloud Providers: Getting Better Rates
Most teams accept published pricing without negotiation, leaving hundreds of thousands on the table. Cloud providers have substantial negotiating room, especially for committed workloads. If you're planning to spend one million dollars on GPU training over the next year, AWS will likely give you 20-30 percent off their published rates in exchange for a commitment. GCP's committed use discounts are published, but they can negotiate beyond that if your volume is large enough. Azure has less flexibility, but spot instance pricing for large volumes can be aggressively discounted.
The negotiation starts with understanding your committed spend. If you can truthfully say "we're planning to train models 24/7 for the next year on 64 H100s," that's leverage. Cloud providers have capacity utilization targets - they'd rather you commit to that capacity at a discount than leave it empty. Get proposals from all three clouds. Share them with others. Play them off each other. This is not unethical; it's business. The standard process is: declare your intent to commit to a volume, get initial quotes, negotiate, get revised quotes, decide.
For organizations larger than mid-market, this can represent millions of dollars in savings. A large AI research lab training dozens of models might save three to four million dollars per year by negotiating effectively. This is a critical conversation to have with your procurement team.
The Strategic Multicloud Approach
Sophisticated organizations use multiple clouds strategically. The typical pattern: primary training on the cloud with the best raw price-performance (usually GCP for preemptibles, AWS for on-demand), fallback capacity on a second cloud for overflow, inference on the cloud with the lowest serving cost (often different from training). This requires some operational complexity - your training code needs to work across clouds, your data needs replication strategies, your monitoring needs to aggregate across clouds. But the flexibility is worth it. You're never held hostage by a single cloud provider's capacity or price. If AWS has a regional outage, you automatically failover to GCP. If GCP raises prices, you shift workloads to Azure. This resilience and negotiating leverage are valuable.
For truly large organizations, the math often works out to a three-cloud strategy where each cloud hosts a different part of your workload. Training on GCP preemptibles (cheapest), inference on AWS (best networking for end-users), and analytics on Azure (deep integration with enterprise tools). This requires more infrastructure investment upfront, but the cost savings and operational resilience compound over time.