You've got a massive machine learning project ahead. Maybe you're training a 7B parameter language model. Maybe you need to fine-tune a vision transformer on your proprietary dataset. Either way, you're staring at the cloud ecosystem and wondering: Where should I actually run this thing?

AWS, Google Cloud, and Microsoft Azure all offer serious GPU horsepower. The prices have shifted dramatically in the last year. Spot instance strategies that worked in 2024 might leave you broke in 2026. And distributed training-pipelines-training-orchestration)-fundamentals)) across GPUs? That's where networking performance becomes your secret weapon.

This article cuts through the marketing noise. We'll compare actual hardware, actual prices (as of February 2026), and show you a Python calculator so you can model your own costs. By the end, you'll know exactly which cloud fits your workload.

The GPU Landscape: What You're Actually Getting

Let me be direct: not all GPUs are created equal. The cloud providers each have a lineup, and they've optimized them for different scenarios. Understanding the hardware hierarchy is crucial because upgrading from a L4 to an A100 isn't just a 2x performance difference - it's more like 10-15x for large-scale training.

AWS P-Series: The Workhorses

AWS offers three main GPU instance families for ML, each optimized for different compute density:

P5 Series (NVIDIA H100 Tensor-pipeline-automated-model-compression)-parallelism) Core GPUs):

p5.4xlarge: 4 × H100 GPUs, 144GB GPU memory, $6.88/hour on-demand
p5.48xlarge: 8 × H100 GPUs (NVSwitch interconnect), 1,152GB CPU memory, $39.80/hour as of January 2026

P4d Series (NVIDIA A100 Tensor Core GPUs):

p4d.24xlarge: 8 × A100 (80GB), 768GB CPU memory, $32.77/hour
NVSwitch provides 600 GB/s GPU-to-GPU bandwidth

G5 Series (NVIDIA L40 or L40S):

g5.12xlarge: 4 × L40S GPUs, $7.32/hour
Best for inference, fine-tuning, not large-scale training

Key takeaway: AWS P5 instances just dropped 45% in June 2025, making them competitive again after years of premium pricing. This is significant - it reshuffles the entire economics of distributed training-ddp-advanced-distributed-training) on AWS.

The pricing history matters because cloud GPU costs shift. A year ago, GCP was significantly cheaper. Now AWS is competitive again. Plan your infrastructure expecting these shifts.

GCP A3 Series: The Dark Horse

Google Cloud's A3 instances use the same H100 GPUs as AWS but in different packaging. Google's infrastructure engineering gives them unique advantages in networking reliability, even without NVSwitch.

A3 High-GPU (a3-highgpu-8g):

8 × H100 80GB GPUs, $88.49/hour on-demand
Fast interconnect (3200 Gbps effective)
No NVSwitch, but Google's custom networking is exceptionally reliable

A3 Mega (a3-megagpu-8g):

16 × H100 GPUs (two instances networked), extreme scale
$176.98/hour base, but often used with Committed Use Discounts (CUD)

Bonus: GCP's preemptible instances cost $2.25/GPU-hour for A3-High. That's roughly 75% off on-demand. The catch? Interruption risk, but more on that later. For many workloads, this is the best-kept secret in cloud GPU pricing.

Key takeaway: GCP's committed use discounts can drive long-term costs below AWS. Preemptibles are a steal if your workload tolerates interruptions. This changes the entire ROI calculation for long-running training jobs.

Azure GPU Options: The Expensive but Reliable Choice

Azure's GPU lineup includes:

ND H100 v5 Series (NDH100v5):

Up to 8 × H100 GPUs per instance
Horovod/MPI-optimized networking
Roughly $6.98/GPU-hour base in East US
Spot instances available at ~$1.22/GPU-hour (huge discount)

NC-series (NCv3, NC_A100_v4):

A100 and older V100 options
More cost-effective for smaller workloads
Less bandwidth between GPUs

NC-A100-v4:

4 or 8 × A100 GPUs per instance
Competitive on price with older GPU hardware

Key takeaway: Azure's spot instances are criminally cheap, but on-demand pricing remains the highest among the three. Use spots or committed discounts. Full-price on-demand on Azure is economically unsustainable for serious GPU workloads-nvidia-kai-scheduler-gpu-job-scheduling)-ml-gpu-workloads).

The Comparison Table: Hard Numbers

Here's what you're actually paying (as of February 2026, USD, us-east/us-central-1 regions). These prices shift quarterly, so treat this as directional:

Provider	Instance Type	GPUs	GPU Memory	On-Demand/hr	Spot/Preemptible/hr	30-day Cost (24/7)
AWS	p5.48xlarge	8 × H100	640GB	$39.80	~$9.95	$28,750
AWS	p4d.24xlarge	8 × A100 (80GB)	640GB	$32.77	~$8.19	$23,600
GCP	a3-highgpu-8g	8 × H100	640GB	$88.49	$22.12 (preemptible)	$63,873
GCP	a3-highgpu-8g	8 × H100	640GB	N/A	$2.25/GPU-hr	$3,888 (preemptible only)
Azure	NDH100v5 (8 GPU)	8 × H100	640GB	~$55.84	~$6.81	$40,203
Azure	NC A100 v4	4 × A100	320GB	~$21.92	~$2.68	$15,784

Notice something? GCP's on-demand price looks absurd - but their CUDs and preemptibles flip the economics. Azure spot is unbeatable for interruption-tolerant workloads. AWS offers the middle ground with good pricing and ecosystem maturity.

The real decision isn't which cloud is "cheapest" - it's which cloud is cheapest for your specific workload. A 2-week training job? GCP preemptibles win. Long-term committed capacity? GCP CUDs or Azure reserved instances. Instant need with no interruption tolerance? AWS on-demand or GCP on-demand.

GPU Networking: The Hidden Performance Layer

You've got 8 H100s in your instance. Great. But how fast can they talk to each other?

This matters enormously for distributed training-zero-memory-efficient-training)-comparison)-zero-memory-efficient-training). A 7B language model distributed across 8 GPUs needs to synchronize gradients constantly. If your inter-GPU communication is slow, you're leaving compute on the table. A 10% networking penalty on an $40/hour instance costs you $4/hour in wasted capacity. Multiply that by 30 days and 5 instances, and you've lost thousands in efficiency.

AWS: NVSwitch (P5 Only)

P5 instances include NVIDIA's NVSwitch interconnect, delivering:

600 GB/s full-duplex bandwidth between any two GPUs
All-to-all connectivity (no bottlenecks)
Seamless scaling to 8+ GPUs

This is why P5 dominates for massive distributed training. NVSwitch is the gold standard. When))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) you need to coordinate gradients across dozens of GPUs with minimal latency, NVSwitch's hardware guarantees are irreplaceable.

GCP: Custom Ethernet + SmartNIC

GCP doesn't use NVSwitch, but they've built something equally smart:

3200 Gbps effective inter-GPU bandwidth
Custom networking stack (low-latency)
Horovod integration (native MPI support)

In practice, GCP trades raw bandwidth for lower latency and more predictable performance. For most LLM training, the difference is negligible. You'll see 5-10% performance differences in specific workloads, not 50%.

Azure: Ethernet + InfiniBand

Azure uses InfiniBand (200 Gbps) or Ethernet depending on the region:

InfiniBand: 200 Gbps (better), available in limited regions
Ethernet: 25-100 Gbps (still fast, but slower than AWS/GCP)

Azure's networking is reliable but slower. This affects distributed training efficiency, especially at scale. A 64-GPU job on Azure might take 12% longer than on AWS, not because of compute, but because of gradient synchronization overhead.

Bottom line: For 64-GPU training clusters, AWS P5 wins on raw networking. GCP is a close second. Azure requires more careful placement and job scheduling. If you're training a 7B model on 8 GPUs, these differences barely matter. At 64+ GPUs, they become significant.

Spot / Preemptible Instances: Risk vs. Reward

The ultimate money hack? Spot instances. These are spare cloud capacity sold at 60-90% discount. The trade-off: they can be interrupted with minimal warning. But with proper checkpointing, interruptions become just another operational cost.

AWS Spot Interruption Rates

AWS publishes real interruption metrics. Here's what's typical:

p5.4xlarge: 2-5% hourly interruption rate
p4d.24xlarge: 1-3% hourly interruption rate
g5.12xlarge: 0.5-2% hourly interruption rate

For 24-7 training, a 3% hourly rate means you'll lose the instance ~17 times per month. Oof. That's 17 restarts. With proper checkpointing, each restart costs you maybe 30 minutes of recomputation (the last batch before preemption). So you're losing about 8.5 hours per month to interruptions - roughly 1.2% of your compute capacity.

GCP Preemptible: The 24-Hour Rule

GCP preemptible instances have a built-in safety: they'll never interrupt you in the first 24 hours. After 24 hours, they can be terminated with 30 seconds notice.

This changes the game. You can:

Request a preemptible instance
Run uninterrupted for 24 hours
Set up auto-restart with checkpointing
Rinse and repeat

For a 2-week training job, you'd expect 14 preemption events (one per day). Manageable, especially with proper checkpointing. And at $2.25/GPU-hour vs $88.49/hour on-demand, that 97% discount more than pays for the operational complexity.

Azure Spot: The Cheapest Option

Azure's spot pricing is unbeatable - $1.22/GPU-hour for H100. But the tradeoff is similar to AWS: interruptions can happen anytime, sometimes with little warning.

Azure's mitigation? Spot eviction notices give you ~2 minutes before termination. Shorter than AWS, but workable. Set up graceful shutdown with proper checkpoint saving, and you can handle it.

Checkpoint-Restart Pattern (Python)

Here's how to survive preemptions. This pattern saves your training state every N steps, so interruptions cost you at most N steps of recomputation:

python

import torch
import json
from pathlib import Path
 
class CheckpointManager:
    def __init__(self, checkpoint_dir="./checkpoints", interval=500):
        self.checkpoint_dir = Path(checkpoint_dir)
        self.checkpoint_dir.mkdir(exist_ok=True)
        self.interval = interval
        self.step = 0
 
    def save_checkpoint(self, model, optimizer, epoch, step):
        """Save model, optimizer state, and metadata."""
        checkpoint = {
            'epoch': epoch,
            'step': step,
            'model_state': model.state_dict(),
            'optimizer_state': optimizer.state_dict(),
        }
        ckpt_path = self.checkpoint_dir / f"checkpoint-epoch{epoch}-step{step}.pt"
        torch.save(checkpoint, ckpt_path)
 
        # Keep metadata for resumption
        metadata = {
            'last_epoch': epoch,
            'last_step': step,
            'last_checkpoint': str(ckpt_path)
        }
        with open(self.checkpoint_dir / "latest.json", "w") as f:
            json.dump(metadata, f)
 
        print(f"Checkpoint saved: {ckpt_path}")
 
    def load_latest(self, model, optimizer):
        """Resume from the most recent checkpoint."""
        metadata_file = self.checkpoint_dir / "latest.json"
        if not metadata_file.exists():
            print("No checkpoint found, starting fresh.")
            return 0, 0
 
        with open(metadata_file, "r") as f:
            metadata = json.load(f)
 
        ckpt_path = metadata['last_checkpoint']
        checkpoint = torch.load(ckpt_path)
        model.load_state_dict(checkpoint['model_state'])
        optimizer.load_state_dict(checkpoint['optimizer_state'])
 
        print(f"Resumed from {ckpt_path}")
        return metadata['last_epoch'], metadata['last_step']
 
# Usage in training loop
model = YourModel()
optimizer = torch.optim.AdamW(model.parameters())
manager = CheckpointManager(interval=500)
 
start_epoch, start_step = manager.load_latest(model, optimizer)
 
for epoch in range(start_epoch, num_epochs):
    for step, batch in enumerate(dataloader, start=start_step):
        # Training step
        logits = model(batch)
        loss = criterion(logits, batch['labels'])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Save checkpoint every N steps
        if (step + 1) % manager.interval == 0:
            manager.save_checkpoint(model, optimizer, epoch, step + 1)

This code saves your model and optimizer state every 500 steps. If your instance gets preempted, you resume from the latest checkpoint - losing only ~5 minutes of compute. That's the operational cost of using preemptibles: write once, save forever.

Managed ML Services: SageMaker vs. Vertex AI vs. Azure ML

Want someone else to manage the cluster? Cloud providers offer managed training platforms that handle orchestration, checkpointing, and scaling for you.

AWS SageMaker

SageMaker handles distributed training orchestration:

Training Jobs: Spin up instances, run code, shut down
Spot Instance Integration: Built-in support, automatic checkpointing for Spot interruptions
Hyperparameter Tuning: Parallel hyperparameter search
Pricing: You pay for the underlying EC2 instances (p5, p4d, etc.) + a small service fee (~2%)

For a p5.48xlarge 2-week job:

EC2 cost: ~$28,750/month on-demand
SageMaker overhead: ~$575
Total: ~$29,325 (with Spot: ~$7,500 + overhead)

SageMaker's value isn't in cost - it's in operational simplicity. You don't manage instances; SageMaker does.

GCP Vertex AI

Vertex AI is Google's managed ML platform:

Training: Native distributed training for TensorFlow, PyTorch, custom code
Preemptible Native: Built-in preemptible instance support with auto-restart
Custom Containers: Bring your own training code in a Docker container
Pricing: You pay for the underlying Compute Engine instances + 10-15% service overhead

For the same 8 × H100 job:

GCP A3-High preemptible cost: ~$3,888 (14 days, ~$2.25/GPU-hr)
Vertex AI overhead: ~$390
Total: ~$4,278 (dramatically cheaper due to preemptibles)

Or with on-demand + CUD:

CUD (3-year): ~50% discount on on-demand
Effective cost: ~$30,000/month, but much lower with multi-year commitment

Azure ML

Azure's managed platform:

ML Compute: GPU clusters with auto-scale
Spot Support: Built-in interruption handling
MLflow Integration: Experiment tracking
Pricing: Pay for underlying VMs + service cost

For NDH100v5 (8 GPU) on spot:

VM cost: ~$1,956 (14 days at $6.81/hr spot)
Azure ML overhead: ~$200
Total: ~$2,156 (cheapest option!)

But: Azure's on-demand pricing is highest, so don't use Azure for non-interruptible work unless you're committed long-term.

Real Cost Scenario: Training a 7B LLM on 64 A100s for 2 Weeks

Let's model a realistic scenario: training your proprietary 7B parameter language model on 64 A100 80GB GPUs for 2 weeks. This is typical for fine-tuning workflows.

Ingredient Breakdown

Training compute: 64 GPUs × 336 hours (2 weeks) = 21,504 GPU-hours
Data egress: Assume 500GB uploaded, 2TB downloaded = $100-200
Storage: 1TB model checkpoints + training data = $15-30/month
Networking overhead: Distributed training, cross-region = $50-100

AWS Scenario (p4d.24xlarge)

We'd need 8 × p4d.24xlarge instances (64 GPUs total):

Instance cost: 8 instances × $32.77/hr × 336 hours = $88,537
Spot savings (if 3% hourly interruption): ~$22,000 (Spot rate ~$8.19/hr)
Effective cost with Spot: ~$66,537
Egress + storage: ~$200
Total: ~$66,737

GCP Scenario (a3-highgpu-8g)

We'd need 8 × a3-highgpu-8g instances:

Option A: On-Demand

Instance cost: 8 instances × $88.49/hr × 336 hours = $238,009
3-year CUD discount (50% off): $119,004
Egress + storage: ~$200
Total: ~$119,204

Option B: Preemptible (More Realistic)

Instance cost: 8 instances × $2.25/GPU-hr (64 GPUs) × 336 hours = $48,384
Preemption overhead (14 restart events, each costing ~30 min recompute): ~$3,000
Egress + storage: ~$200
Total: ~$51,584 (but requires interrupt-tolerant job!)

Azure Scenario (NDH100v5 Spot)

We'd need 8 × NDH100v5 instances (8 GPUs each):

Spot cost: 8 instances × $6.81/hr × 336 hours = $18,380
Egress + storage: ~$200
Eviction overhead (assuming 5% eviction rate): ~$1,000
Total: ~$19,580 (absolute cheapest!)

Cost Summary

Cloud	Strategy	2-Week Cost
AWS	On-Demand	$88,737
AWS	Spot (with overhead)	$66,737
GCP	On-Demand + CUD	$119,204
GCP	Preemptible	$51,584
Azure	Spot	$19,580

Azure's spot pricing is unbeatable for tolerant workloads. But remember: interruptions require solid engineering. GCP preemptible is the sweet spot: cheap AND predictable (24-hour guarantee). AWS on-demand is the safest bet if you can't afford interruptions.

GPU Instance Selection Decision Tree

Here's how to pick your instance type:

graph TD
    A["Need GPU instances?"] -->|No| B["Use CPU instances (cheaper)"]
    A -->|Yes| C{"Workload type?"}
 
    C -->|Inference only| D["Use L4/L40 instances<br/>AWS G5, GCP L4, Azure GA"]
    C -->|Fine-tuning| E["Use A100 instances<br/>Cost-effective training"]
    C -->|Large-scale training| F{"Scale?"}
 
    F -->|Single machine<br/>8-16 GPUs| G{"Interrupt<br/>tolerant?"}
    F -->|64+ GPUs| H{"Distributed<br/>networking critical?"}
 
    G -->|Yes| I["GCP preemptible A3<br/>or Azure Spot<br/>75% discount"]
    G -->|No| J["AWS P4d on-demand<br/>or GCP CUD<br/>Reliable + fast"]
 
    H -->|Yes| K["AWS P5<br/>NVSwitch<br/>600GB/s"]
    H -->|No| L["GCP A3 or<br/>Azure ND<br/>Still fast enough"]

Python Cost Calculator: Customize for Your Workload

Let's build a tool you can actually use. Here's a Python script to model your specific training scenario:

python

#!/usr/bin/env python3
"""
Cloud GPU cost calculator for ML training workloads.
Customize with your parameters and run to get per-cloud cost estimates.
"""
 
from dataclasses import dataclass
from typing import Dict
 
@dataclass
class InstanceConfig:
    name: str
    provider: str
    gpu_count: int
    gpu_memory_gb: int
    hourly_on_demand: float
    hourly_spot: float
    inter_gpu_bandwidth_gbps: int
 
# Define instance configurations (Feb 2026 pricing)
INSTANCES = {
    'aws_p5.48xlarge': InstanceConfig(
        name='AWS p5.48xlarge',
        provider='AWS',
        gpu_count=8,
        gpu_memory_gb=640,
        hourly_on_demand=39.80,
        hourly_spot=9.95,
        inter_gpu_bandwidth_gbps=600
    ),
    'aws_p4d.24xlarge': InstanceConfig(
        name='AWS p4d.24xlarge',
        provider='AWS',
        gpu_count=8,
        gpu_memory_gb=640,
        hourly_on_demand=32.77,
        hourly_spot=8.19,
        inter_gpu_bandwidth_gbps=600
    ),
    'gcp_a3_highgpu': InstanceConfig(
        name='GCP a3-highgpu-8g',
        provider='GCP',
        gpu_count=8,
        gpu_memory_gb=640,
        hourly_on_demand=88.49,
        hourly_spot=2.25 * 8,  # $2.25 per GPU
        inter_gpu_bandwidth_gbps=3200
    ),
    'azure_ndh100v5': InstanceConfig(
        name='Azure NDH100v5 (8 GPU)',
        provider='Azure',
        gpu_count=8,
        gpu_memory_gb=640,
        hourly_on_demand=55.84,
        hourly_spot=6.81 * 8,  # ~$6.81/hr per GPU
        inter_gpu_bandwidth_gbps=200
    ),
}
 
def calculate_job_cost(
    instance_config: InstanceConfig,
    num_instances: int,
    hours: float,
    use_spot: bool = False,
    egress_gb: float = 2000,
    storage_gb: float = 1000,
    monthly_storage_cost: float = 23,
    interruption_rate: float = 0.03
) -> Dict[str, float]:
    """
    Calculate total cost for a distributed training job.
 
    Args:
        instance_config: Instance configuration
        num_instances: Number of instances to run
        hours: Training duration in hours
        use_spot: Whether to use spot/preemptible instances
        egress_gb: Data egress in GB
        storage_gb: Storage needed in GB
        monthly_storage_cost: Cost per 1000 GB per month
        interruption_rate: Hourly interruption rate for spot instances
 
    Returns:
        Dictionary with cost breakdown
    """
 
    hourly_rate = instance_config.hourly_spot if use_spot else instance_config.hourly_on_demand
 
    # Compute cost
    compute_cost = num_instances * hourly_rate * hours
 
    # Interruption penalty (recompute cost)
    interruption_penalty = 0
    if use_spot and interruption_rate > 0:
        expected_interruptions = (hours / 24) * interruption_rate * 24
        interruption_penalty = expected_interruptions * (num_instances * hourly_rate * 1)  # Assume 1-hour recompute per interruption
 
    # Data egress cost (~$0.10/GB)
    egress_cost = egress_gb * 0.10
 
    # Storage cost
    duration_months = hours / 730  # Approximate hours per month
    storage_cost = (storage_gb / 1000) * monthly_storage_cost * duration_months
 
    # Total
    total_cost = compute_cost + interruption_penalty + egress_cost + storage_cost
 
    return {
        'compute_cost': compute_cost,
        'interruption_penalty': interruption_penalty,
        'egress_cost': egress_cost,
        'storage_cost': storage_cost,
        'total_cost': total_cost,
        'cost_per_gpu_hour': total_cost / (instance_config.gpu_count * num_instances * hours)
    }
 
# Example: 64 GPU training job for 2 weeks
print("=" * 80)
print("SCENARIO: 64-GPU Training Job (2 weeks)")
print("=" * 80)
 
# AWS P4d: 8 instances (64 GPUs)
aws_p4d_result = calculate_job_cost(
    INSTANCES['aws_p4d.24xlarge'],
    num_instances=8,
    hours=336,
    use_spot=True,
    interruption_rate=0.03
)
print("\nAWS p4d.24xlarge (Spot):")
print(f"  Compute: ${aws_p4d_result['compute_cost']:,.2f}")
print(f"  Interruption penalty: ${aws_p4d_result['interruption_penalty']:,.2f}")
print(f"  Egress: ${aws_p4d_result['egress_cost']:,.2f}")
print(f"  Total: ${aws_p4d_result['total_cost']:,.2f}")
print(f"  $/GPU-hour: ${aws_p4d_result['cost_per_gpu_hour']:.4f}")
 
# GCP A3: 8 instances (64 GPUs), preemptible
gcp_result = calculate_job_cost(
    INSTANCES['gcp_a3_highgpu'],
    num_instances=8,
    hours=336,
    use_spot=True,
    interruption_rate=0.04  # 1 per day
)
print("\nGCP a3-highgpu-8g (Preemptible):")
print(f"  Compute: ${gcp_result['compute_cost']:,.2f}")
print(f"  Interruption penalty: ${gcp_result['interruption_penalty']:,.2f}")
print(f"  Egress: ${gcp_result['egress_cost']:,.2f}")
print(f"  Total: ${gcp_result['total_cost']:,.2f}")
print(f"  $/GPU-hour: ${gcp_result['cost_per_gpu_hour']:.4f}")
 
# Azure ND: 8 instances (64 GPUs), spot
azure_result = calculate_job_cost(
    INSTANCES['azure_ndh100v5'],
    num_instances=8,
    hours=336,
    use_spot=True,
    interruption_rate=0.05  # ~5% per hour
)
print("\nAzure NDH100v5 (Spot):")
print(f"  Compute: ${azure_result['compute_cost']:,.2f}")
print(f"  Interruption penalty: ${azure_result['interruption_penalty']:,.2f}")
print(f"  Egress: ${azure_result['egress_cost']:,.2f}")
print(f"  Total: ${azure_result['total_cost']:,.2f}")
print(f"  $/GPU-hour: ${azure_result['cost_per_gpu_hour']:.4f}")
 
print("\n" + "=" * 80)
print("WINNER (for interrupt-tolerant workloads): Azure Spot")
print("RUNNER-UP (balanced cost + reliability): GCP Preemptible")
print("SAFEST (no interruptions): AWS P4d On-Demand or GCP with CUD")
print("=" * 80)

Run this script and customize the parameters. Plug in your actual:

Training duration
Number of GPUs needed
Data transfer volumes
Interruption tolerance

Decision Framework: Which Cloud for Your Workload?

Choose AWS if:

You need absolute reliability (no interruptions)
You're training at massive scale (100+ GPUs) and need NVSwitch
You already use AWS for infrastructure (ecosystem consistency)
You prefer established, proven MLOps tooling

Choose GCP if:

You can tolerate 24-hour preemption cycles (guaranteed no interruption per day)
You want the cheapest total cost of ownership with managed ML
You need committed use discounts for predictable, long-term workloads
Your team loves Python and TensorFlow (native integration)

Choose Azure if:

Cost is your only variable (spot pricing unbeatable)
You're already deep in the Microsoft ecosystem
You need on-demand reliability AND spot availability (Azure's stronger reserved instance ecosystem helps)

The Operational Reality: What Pricing Tables Don't Tell You

Let me be honest about something the pricing tables obscure: the total cost of ownership goes way beyond)) hourly instance rates. You're paying for networking, storage, data transfer, and most importantly, operational overhead.

When you use spot or preemptible instances at massive scale, you're introducing complexity. You need robust checkpointing. You need monitoring and alerting for preemption events. You need playbooks for automated restart. For a team of five, managing one spot instance is trivial. For a team managing fifty concurrent spot jobs across multiple regions, it becomes a full-time operational burden. We've seen teams save twenty percent on compute costs by switching to spot, then lose fifty percent of that savings to the engineer-hours spent managing interruptions and debugging failures.

Data transfer costs are often overlooked. If you're training a large model, you're moving terabytes of data in and out of cloud storage. AWS charges thirteen cents per GB for egress. That adds up fast. GCP is slightly cheaper at twelve cents. Azure is ten cents, which matters if you're moving dozens of terabytes. For a 2-week training job moving 10TB out, you're looking at 1200 dollars to 1300 dollars in egress costs alone. This isn't reflected in the comparison tables, but it's very real.

Storage costs also compound. You're not just paying for the model checkpoints you save - you're paying for temporary storage during training, intermediate artifacts, tensorboard logs, and everything else that gets written to disk. A petabyte-month of storage at AWS S3 standard is 23000 dollars. For a large training job generating lots of diagnostic data, this matters.

Then there's the human factor. Running GPU infrastructure requires someone on call. Someone needs to monitor utilization, catch runaway jobs, debug failures. That's not free. Managed services like SageMaker and Vertex AI offload this burden to the cloud provider, which is why they're worth the overhead cost even though they're technically more expensive per compute unit.

Multicloud Strategies: Hedging Your Bets

Smart organizations don't bet everything on one cloud provider. They deploy across AWS, GCP, and Azure strategically. Here's how we see mature teams approach this.

For training workloads, they use GCP preemptibles as the primary compute because of cost, with AWS P4d instances as a fallback-fallback) for jobs that can't tolerate interruptions. For inference, they use AWS on-demand for latency-sensitive applications, GCP committed discounts for baseline capacity, and Azure spot for bursty workloads. This diversification protects against regional outages, handles different workload characteristics with appropriately priced hardware, and creates negotiating leverage with cloud providers.

The downside is operational complexity. Your CI/CD needs to be cloud-agnostic. Your container images need to work across AWS, GCP, and Azure. Your monitoring and logging need to aggregate across providers. Your data replication needs to account for geo-distribution. But the benefits - cost optimization, reliability, and strategic flexibility - often justify the engineering investment.

Wrapping Up: The Real Cost of ML at Scale

You've now got the full picture. The cloud GPU market is competitive, pricing evolves monthly, and the "best" choice depends entirely on your tolerance for interruptions, your duration commitment, and your operational maturity.

Let's synthesize the key insights. AWS excels at scale with NVSwitch providing unmatched inter-GPU bandwidth, making it the clear choice for ultra-large training clusters where every percentage point of efficiency matters. The P5 infrastructure represents the frontier of distributed GPU performance, though that leadership comes at a cost that newer teams might not be able to justify. GCP offers a fundamentally different value proposition, particularly through preemptible instances and committed use discounts that can drive total cost of ownership below AWS for the right workload profile. If your training job can tolerate daily interruptions, GCP's economics are almost unbeatable. Azure rounds out the comparison with the most aggressive spot pricing, making it ideal for cost-optimized workloads run by teams with mature interruption-handling infrastructure.

Key takeaways:

AWS P5 with NVSwitch is king for ultra-large-scale distributed training requiring high inter-GPU bandwidth.
GCP preemptible A3 offers the best balance of cost and predictability for tolerant workloads.
Azure spot is unbeatable on raw price if you can handle frequent interruptions.
Managed ML services (SageMaker, Vertex, Azure ML) add 2-10% overhead but handle orchestration, checkpointing, and auto-scaling for you.
Checkpoint-restart patterns are non-negotiable if you use spot/preemptible instances.
Data transfer and storage costs are often larger than you expect - factor them into your TCO analysis.
Multicloud strategies hedge your bets and optimize for specific workload types.

The winning strategy depends on your situation. Teams just starting with GPU training should probably begin with GCP preemptibles or Azure spot to minimize spend while they learn the operational patterns. As you scale and develop operational maturity, AWS on-demand or committed discounts become more cost-effective. Long-term, mature teams often use all three clouds, routing workloads intelligently to minimize total cost.

Use the Python calculator to model your specific workload, but don't stop there. Run small pilot jobs on each cloud. Measure actual costs, not just hourly rates. Account for data transfer, storage, and operational overhead. Benchmark end-to-end training time, not just compute time. And remember: the cheapest GPU isn't the one that costs the least per hour - it's the one that trains your model fastest while staying within budget, with operational overhead you can actually manage.

The cloud GPU market will continue evolving. Prices shift. New instance types launch. Your job is to understand the architecture deeply enough that you can adapt to changes, not just chase whoever has the lowest advertised rate this month.

Negotiating with Cloud Providers: Getting Better Rates

Most teams accept published pricing without negotiation, leaving hundreds of thousands on the table. Cloud providers have substantial negotiating room, especially for committed workloads. If you're planning to spend one million dollars on GPU training over the next year, AWS will likely give you 20-30 percent off their published rates in exchange for a commitment. GCP's committed use discounts are published, but they can negotiate beyond that if your volume is large enough. Azure has less flexibility, but spot instance pricing for large volumes can be aggressively discounted.

The negotiation starts with understanding your committed spend. If you can truthfully say "we're planning to train models 24/7 for the next year on 64 H100s," that's leverage. Cloud providers have capacity utilization targets - they'd rather you commit to that capacity at a discount than leave it empty. Get proposals from all three clouds. Share them with others. Play them off each other. This is not unethical; it's business. The standard process is: declare your intent to commit to a volume, get initial quotes, negotiate, get revised quotes, decide.

For organizations larger than mid-market, this can represent millions of dollars in savings. A large AI research lab training dozens of models might save three to four million dollars per year by negotiating effectively. This is a critical conversation to have with your procurement team.

The Strategic Multicloud Approach

Sophisticated organizations use multiple clouds strategically. The typical pattern: primary training on the cloud with the best raw price-performance (usually GCP for preemptibles, AWS for on-demand), fallback capacity on a second cloud for overflow, inference on the cloud with the lowest serving cost (often different from training). This requires some operational complexity - your training code needs to work across clouds, your data needs replication strategies, your monitoring needs to aggregate across clouds. But the flexibility is worth it. You're never held hostage by a single cloud provider's capacity or price. If AWS has a regional outage, you automatically failover to GCP. If GCP raises prices, you shift workloads to Azure. This resilience and negotiating leverage are valuable.

For truly large organizations, the math often works out to a three-cloud strategy where each cloud hosts a different part of your workload. Training on GCP preemptibles (cheapest), inference on AWS (best networking for end-users), and analytics on Azure (deep integration with enterprise tools). This requires more infrastructure investment upfront, but the cost savings and operational resilience compound over time.

Cloud GPU Options Compared: AWS vs GCP vs Azure for ML