You've built a model. It works. But training takes days, or weeks. Your GPU sits at 30% utilization. Memory fills up after three batches. Sound familiar?

Welcome to the real world of deep learning. This is where theory hits the wall and optimization becomes everything.

Why Scaling Training Is the Real Bottleneck in Modern AI

Here's the uncomfortable truth that most tutorials skip: your model architecture is rarely the bottleneck. The real constraint is almost always training speed. You can have the most elegant neural network ever designed, but if it takes three weeks to train on your hardware, you're not going to iterate, experiment, or catch mistakes fast enough to make it useful. The difference between a research project that produces results and one that stalls out is often nothing more than efficient GPU utilization.

Think about what fast training actually enables. It means you can run five experiments in the time it used to take to run one. It means you catch a subtle bug in epoch two instead of discovering it at epoch forty. It means your learning rate schedule actually matters because you can afford to test it. This is why the world's top AI labs invest so much in training infrastructure, not because they're showing off, but because iteration speed is compound interest. Every week you shave off training time is a week you can spend on a better idea.

The techniques we're covering in this article are not exotic. They're the standard toolkit used by ML engineers at companies building production AI systems right now. Mixed precision training, gradient checkpointing, distributed data parallelism, these aren't research curiosities anymore. They're table stakes. The good news is that PyTorch has made all of them remarkably accessible in the past few years, and you don't need a cluster of A100s to benefit. Even on a single consumer GPU, these techniques routinely cut training time in half or better. That's the promise. Now let's deliver on it.

In this article, we're going to demolish your training bottlenecks. You'll learn how to squeeze 2x-4x speedups out of your existing hardware without changing a single line of model code. We're talking GPU memory management, mixed precision training, distributed computing, and the modern tools that make it all click together.

Let's go.

Why GPU Training Matters (And Why It's Not Automatic)

Here's the thing: buying a GPU doesn't automatically make training fast. A Tesla A100 can sit idle while your poorly optimized training loop starves it of data. A V100 can run out of memory on a batch size that should fit.

The gap between peak performance and actual throughput is where real engineering lives. That gap is your opportunity.

When we talk about training at scale, we're really asking three questions:

How much data can we push through per second? (Throughput)
How much GPU memory do we actually need? (Memory efficiency)
Can we distribute work across multiple GPUs or machines? (Scaling)

Get these right, and training time drops. Get them wrong, and you're watching a progress bar inch forward while your cloud bill climbs.

GPU Computing Fundamentals

Before you can optimize GPU training, you need to understand why GPUs are fast in the first place, and why they're fast at some things and not others. A modern CPU has between 8 and 64 cores, each highly capable of complex sequential logic. An A100 GPU has 6912 CUDA cores. That's not a typo. The GPU wins on parallelism by a factor of hundreds, which is exactly what matrix multiplication, the core operation of neural networks, demands.

The key insight is memory bandwidth. A CPU communicates with RAM at roughly 50-100 GB/s. An A100 communicates with its on-chip HBM2e memory at 2 TB/s. When your training loop stalls, it's almost always because you're not feeding the GPU fast enough, not because the GPU itself is slow. Your CPU-based data pipeline, your disk reads, your Python-level preprocessing, these are the bottlenecks. The GPU is sitting there, hungry, waiting for work.

This distinction matters practically. When you see GPU utilization hovering at 30-40% in nvidia-smi, the GPU isn't the problem. The data loader is. This is why num_workers > 0 in your DataLoader matters enormously, you're telling Python to prefetch batches in parallel while the GPU processes the current one. It's also why pin_memory=True helps: pinned memory transfers to the GPU faster because it bypasses a copy step. Understanding that your GPU is a throughput machine that needs to be continuously fed, not a smart machine that can wait patiently, changes how you think about every optimization decision downstream.

GPU Memory: The Hard Ceiling

Your GPU has finite memory. An A100 has 40GB. A 3090 has 24GB. A T4 has 16GB. That's it.

Everything fits in there:

Your model weights
Activations (needed for backprop)
Gradients
Optimizer state
Batch data

Run out of memory, and you get the dreaded CUDA out of memory error. Training stops. You reduce batch size. Training slows down.

Profiling Your Memory Usage

Before optimizing, you need to see what's happening. Guessing at memory usage is a losing game, you need hard numbers before you can make smart tradeoffs. The two most important distinctions are between allocated memory (tensors your code is actually holding) and reserved memory (what CUDA grabbed from the OS as a buffer). Watching these numbers during a training run tells you whether your model is the problem or your data pipeline is.

python

import torch
from torch.utils.tensorboard import SummaryWriter
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
# Check memory state
print(f"Allocated: {torch.cuda.memory_allocated(device) / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved(device) / 1e9:.2f} GB")
print(f"Max allocated: {torch.cuda.max_memory_allocated(device) / 1e9:.2f} GB")
 
# Reset for clean profiling
torch.cuda.reset_peak_memory_stats()

Why this matters: "Allocated" is what your tensors actually use. "Reserved" is what CUDA grabbed from the OS (it's often more, for efficiency). "Max allocated" tells you the high-water mark during training. The max allocated number is the one to optimize, it's what determines whether you hit OOM.

PyTorch's memory profiler goes deeper, giving you per-operation breakdowns that let you pinpoint exactly which layer or operation is eating your memory budget. This is particularly useful for transformer models where attention mechanisms can spike memory dramatically for long sequences, or for models with many skip connections that hold activations in memory simultaneously.

python

from torch.profiler import profile, record_function, ProfilerActivity
 
with profile(activities=[ProfilerActivity.CUDA],
             record_shapes=True,
             profile_memory=True) as prof:
    output = model(input_data)
    loss = criterion(output, labels)
    loss.backward()
 
print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))

This shows you which operations are memory hogs. Usually it's the activations during the forward pass, or the gradients during backprop. Once you know the culprit, you can target your optimization precisely instead of trying random tricks and hoping one sticks.

Batch Size: The Tuning Knob

Larger batches are more efficient. Smaller batches fit in memory. This is the fundamental tradeoff.

Here's the practical dance:

python

# Start small, work up
batch_sizes = [8, 16, 32, 64, 128, 256]
 
for bs in batch_sizes:
    train_loader = DataLoader(dataset, batch_size=bs)
    try:
        for batch, labels in train_loader:
            batch = batch.to(device)
            labels = labels.to(device)
            output = model(batch)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        print(f"Batch size {bs}: SUCCESS")
    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"Batch size {bs}: OUT OF MEMORY")
            break
        raise

Why? It's not linear. Going from batch 32 to 64 might add 1.5x memory (not 2x) because some overhead is constant. Your job is finding the sweet spot where GPU memory is 80-90% utilized. At that utilization level, you're getting maximum computational throughput per dollar spent on hardware.

Gradient Checkpointing: Memory for Compute

Here's a clever trick: during backprop, you need activations from the forward pass. They're stored in memory. Gradient checkpointing deletes them, then recomputes them during backprop.

It sounds wasteful (recomputing!), but it cuts memory use by 30-40% while adding 20-30% compute time. Trade-off? Usually worth it, especially if the alternative is reducing your batch size to something that kills GPU utilization entirely. The math is simple: if you're memory-constrained and can't fit a reasonable batch size, spending 25% more compute time to free up 35% of memory is a great deal.

python

from torch.utils.checkpoint import checkpoint
 
class SmallModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(1024, 1024)
        self.layer2 = torch.nn.Linear(1024, 1024)
        self.layer3 = torch.nn.Linear(1024, 10)
 
    def forward(self, x):
        # Normal: stores all activations
        # x = self.layer1(x)
        # x = torch.relu(x)
        # x = self.layer2(x)
        # x = torch.relu(x)
 
        # With checkpointing: recomputes layer1 during backprop
        x = checkpoint(self._layer1_block, x, use_reentrant=False)
        x = checkpoint(self._layer2_block, x, use_reentrant=False)
        return self.layer3(x)
 
    def _layer1_block(self, x):
        x = self.layer1(x)
        return torch.relu(x)
 
    def _layer2_block(self, x):
        x = self.layer2(x)
        return torch.relu(x)

The use_reentrant=False flag is important, it's the modern API that's more stable. The older reentrant mode has edge cases that can produce incorrect gradients in certain model architectures, so always use the newer interface unless you have a specific reason not to.

Mixed Precision: Speed Without Loss

Here's where things get interesting. Floating-point numbers don't all need the same precision, and exploiting this fact is one of the highest-leverage optimizations available to you. The intuition is straightforward: most of the numerical work in a neural network forward pass doesn't require the full 23 bits of mantissa that float32 provides. The actual information content, the signal that drives learning, survives in 10 bits just fine for the vast majority of operations.

Modern Nvidia GPUs starting from the Volta architecture (V100, 2017) include dedicated Tensor Core hardware that processes float16 matrix multiplications at roughly twice the throughput of float32 equivalents. On the A100, this advantage is even more pronounced with bfloat16 support, which offers the same range as float32 but reduced precision, making it numerically stabler than float16 for training. The key question is always where precision actually matters: model weights and intermediate activations can tolerate float16, but loss computation and optimizer states need float32 to avoid the subtle numerical drift that causes training instability over thousands of steps.

Float32 (full precision) is safe. You can train anything in float32, but it's slow and memory-hungry.

Float16 (half precision) is half the memory and storage. Some operations are 2-4x faster. But it's numerically unstable, training can explode or collapse.

Mixed precision is the sweet spot: use float16 for forward/backward passes, keep float32 for loss computation and optimizer state. PyTorch handles the casting automatically, so you don't need to manually annotate every tensor. You opt in at the training loop level, and the autocast context manager figures out which operations benefit from reduced precision and which need to stay at full precision.

torch.cuda.amp: Your New Best Friend

PyTorch's torch.cuda.amp module handles this automatically. The autocast context manager dynamically casts operations to float16 where it's safe to do so, things like linear layers, convolutions, and most activations. Operations that are sensitive to precision, like loss accumulation and batch normalization statistics, stay in float32. You don't have to think about which is which.

python

from torch.cuda.amp import autocast, GradScaler
 
device = torch.device("cuda")
model = MyModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()
 
for epoch in range(num_epochs):
    for batch, labels in train_loader:
        batch, labels = batch.to(device), labels.to(device)
 
        # Autocast: forward pass in float16 where safe
        with autocast():
            output = model(batch)
            loss = criterion(output, labels)
 
        # Backward with loss scaling (prevents gradient underflow)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Why the GradScaler? In float16, gradients can become tiny (underflow to zero). Scaling inflates them before backprop, then deflates them before the optimizer step. Automatic, transparent. The scaler also monitors for overflow, if it detects that a gradient has gone to infinity (the opposite problem), it skips the optimizer step for that batch and reduces the scale factor. This self-correcting behavior means you can enable mixed precision and trust it to handle the edge cases without babysitting.

Real Numbers: Mixed Precision Performance

Here's what you get in practice:

Memory usage: 40-50% reduction (model + optimizer state)
Training speed: 1.5-2.5x faster (depending on GPU and model)
Convergence: Identical (or imperceptibly close) with proper tuning

Let's prove it by running a direct comparison benchmark on the same model and data, toggling only the mixed precision flag. The measured time per batch is the ground truth, ignore theoretical FLOP counts and focus on wall-clock throughput, because that's what determines when your experiment finishes.

python

import time
 
def train_step(model, batch, labels, optimizer, scaler, use_mixed_precision=False):
    start = time.time()
 
    if use_mixed_precision:
        with autocast():
            output = model(batch)
            loss = criterion(output, labels)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    else:
        output = model(batch)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
 
    optimizer.zero_grad()
    return time.time() - start, loss.item()
 
# Benchmark
for mixed in [False, True]:
    times = []
    for batch, labels in train_loader:
        batch, labels = batch.to(device), labels.to(device)
        t, _ = train_step(model, batch, labels, optimizer, scaler, mixed)
        times.append(t)
 
    avg_time = sum(times) / len(times)
    print(f"Mixed precision={mixed}: {avg_time*1000:.2f}ms per batch")

On an A100 with a ResNet-50: you'll see ~2x speedup. On older GPUs, closer to 1.5x. The gains scale with model size, larger models with more matrix multiplications see larger benefits, which is why mixed precision has become essentially mandatory for training large language models where every percentage point of efficiency matters at the scale of weeks of compute.

Gradient Accumulation: Bigger Batches Without More Memory

Sometimes you want larger effective batch sizes, but your GPU can't hold them. Solution: gradient accumulation.

Instead of updating weights after each batch, accumulate gradients over several batches, then update once. The mathematical effect is identical to training with a batch size equal to your accumulation factor times your actual batch size. This matters because larger batches often produce better gradient estimates, allowing you to use higher learning rates and converge faster. Research from the ImageNet training community showed that scaling batch size and learning rate together enables dramatically faster training without sacrificing accuracy, gradient accumulation is how you access those benefits when you're memory-constrained.

python

accumulation_steps = 4  # Accumulate 4 batches before update
optimizer.zero_grad()
 
for i, (batch, labels) in enumerate(train_loader):
    batch, labels = batch.to(device), labels.to(device)
 
    with autocast():
        output = model(batch)
        loss = criterion(output, labels)
 
    # Scale loss by accumulation steps to keep gradients sane
    (loss / accumulation_steps).backward()
 
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Why scale by accumulation_steps? Because gradients add up. If you don't scale, your effective learning rate quadruples. The scaling ensures that the gradient magnitude going into the optimizer is the same as it would be if you'd processed the full accumulated batch at once, preserving the semantics of your learning rate without any manual adjustment.

Effect: You get the gradient statistics of a batch 4x larger, without needing 4x memory.

Distributed Training Strategies

When a single GPU isn't enough, distributed training is how you break through the ceiling. But "distributed training" is not a single thing, it's a spectrum of strategies with very different tradeoffs, and choosing the wrong one can cost you more in communication overhead than you gain from the extra hardware. The two axes that matter are how you split the work and how you synchronize the results.

Data parallelism is the most common approach: each GPU gets a copy of the full model and processes a different subset of the batch. At the end of each forward-backward pass, the GPUs communicate to average their gradients. This works well when your model fits in GPU memory and you just want to process more data per step. Model parallelism is the alternative: split the model itself across GPUs, with different layers living on different devices. This is what you need when the model itself is too large for any single GPU, it's how GPT-3-scale models are trained. For most practitioners building serious but not frontier-scale models, data parallelism is the right tool, and PyTorch's DistributedDataParallel makes it relatively painless to set up correctly.

DataParallel: Simple, synchronous, slower (useful for prototyping).

DistributedDataParallel (DDP): Slightly more setup, but 3-4x faster on multi-GPU.

DataParallel (Quick and Dirty)

DataParallel is the simplest path to multi-GPU training, it wraps your model with a single function call. The trade-off is that it uses one GPU as a coordinator, shuttling data to the others and collecting results, which creates a communication bottleneck. For two GPUs this is usually fine; beyond that, the coordinator GPU becomes a throughput constraint that negates much of the scaling benefit.

python

model = MyModel()
model = torch.nn.DataParallel(model)
model.to(device)
 
# Everything else is the same
for batch, labels in train_loader:
    batch, labels = batch.to(device), labels.to(device)
    output = model(batch)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

DataParallel automatically splits batches across GPUs, gathers results, and synchronizes gradients. Easy. But it has communication overhead that kills scaling beyond 2-3 GPUs. If you have more hardware than that, or if you're serious about training efficiency, DDP is where you need to be.

DistributedDataParallel (The Real Deal)

DDP uses a fundamentally different communication model. Instead of a coordinator GPU, every GPU is a peer. Gradients are synchronized using ring-allreduce, each GPU sends and receives from its neighbors in a ring topology, so communication is distributed evenly. This scales efficiently to dozens of GPUs and is the standard approach for any serious multi-GPU training. The setup cost is real but one-time.

python

import torch.distributed as dist
 
# Initialize distributed training
dist.init_process_group("nccl")
local_rank = int(os.environ["LOCAL_RANK"])
world_size = int(os.environ["WORLD_SIZE"])
 
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
 
model = MyModel().to(device)
model = torch.nn.parallel.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    output_device=local_rank
)
 
# Sampler ensures each process gets unique batches
sampler = torch.utils.data.distributed.DistributedSampler(
    dataset,
    num_replicas=world_size,
    rank=dist.get_rank(),
    shuffle=True
)
train_loader = DataLoader(dataset, sampler=sampler, batch_size=batch_size)
 
# Training loop (same as single GPU)
for batch, labels in train_loader:
    batch, labels = batch.to(device), labels.to(device)
    output = model(batch)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Then launch with torchrun:

bash

torchrun --nproc_per_node=4 train.py

This spawns 4 processes (one per GPU), each handling its own batch, communicating via NCCL (Nvidia's fast collective communication library). Efficiency scales linearly up to 8 GPUs, then tails off (network overhead). The DistributedSampler is a critical detail that's easy to forget, without it, every GPU sees the same data, which means you're not actually getting the batch size scaling benefit you're paying for.

Torch.compile: PyTorch 2.0's Hidden Weapon

PyTorch 2.0 introduced torch.compile(), which JIT-compiles your model into an optimized kernel. One line. Serious speed.

python

model = MyModel()
model = torch.compile(model)
 
# Now training runs through optimized kernels
for batch, labels in train_loader:
    batch, labels = batch.to(device), labels.to(device)
    output = model(batch)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

Results vary by model, but expect 20-40% speedup on modern GPUs (A100, H100). It's slower on older hardware. The compilation step happens on the first forward pass, so your first batch will be noticeably slow, don't be alarmed. Subsequent batches benefit from the optimized kernels that were generated. This compilation cost is a one-time investment that pays off across all future batches in the training run.

Trade-off: First run is slow (compilation), then everything flies. Also, not all operations are supported (though that list is growing). If torch.compile hits an unsupported operation, it falls back gracefully to eager mode for that part of the graph rather than crashing, so the worst case is that you don't get the speedup, not that your training breaks.

Hugging Face Accelerate: Distributed Training for Mortals

If DDP looks like too much boilerplate, transformers library's accelerate library handles it for you. The premise is that you shouldn't need to restructure your training code to go from one GPU to eight, or from one machine to a cluster. accelerate achieves this by wrapping your model, optimizer, and data loader in a thin abstraction that handles the distribution details based on the environment it detects at runtime.

python

from accelerate import Accelerator
 
accelerator = Accelerator()
model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)
 
for batch, labels in train_loader:
    with accelerator.accumulate(model):
        output = model(batch)
        loss = criterion(output, labels)
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

One function call replaces 20 lines of setup. Handles DDP, DeepSpeed, mixed precision, gradient accumulation, all at once. It's the Goldilocks solution for most practitioners. You configure everything through a config file generated by accelerate config, which walks you through your hardware setup interactively, and then the same training script runs unmodified on a laptop, a single GPU, eight GPUs, or a distributed cluster.

Common Scaling Mistakes

The techniques above are powerful, but they come with failure modes that are easy to stumble into. Understanding the most common mistakes saves you from spending days debugging mysterious convergence failures or performance regressions.

The first and most pervasive mistake is forgetting to call optimizer.zero_grad() in the right place when using gradient accumulation. If you zero gradients before the accumulation window is complete, you silently discard the accumulated signal. If you never zero them, gradients accumulate across multiple accumulation windows and your effective batch size grows unboundedly. Always confirm your zeroing happens exactly once per optimizer step.

The second mistake is scaling learning rate incorrectly when changing effective batch size. The linear scaling rule says: if you multiply your batch size by N, multiply your learning rate by N. Violating this produces training that appears to work (loss goes down) but converges to worse solutions, the kind of subtle degradation that's hard to notice without careful comparison baselines. When you enable gradient accumulation or go distributed, your effective batch size changes, and your learning rate should change with it.

The third mistake is ignoring the num_workers parameter in DataLoader. Leaving it at zero means data loading happens synchronously on the main thread, and your GPU sits idle while Python decodes images. A good starting point is num_workers=4; tune it upward until GPU utilization stops improving. Pair this with pin_memory=True on systems with CUDA to speed up host-to-device memory transfers.

The fourth mistake is neglecting to call sampler.set_epoch(epoch) in distributed training. The DistributedSampler uses the epoch number to seed its shuffling. Without set_epoch, every epoch uses the same ordering, which means your training data is effectively not shuffled across epochs, a subtle data poisoning that hurts generalization without triggering any obvious errors or warnings.

Throughput Benchmarking: Measuring What Matters

All this optimization means nothing if you're not measuring the right thing.

Throughput = samples per second. This is your north star metric. Not loss curves (which measure learning, not speed), not GPU memory usage (which measures headroom, not throughput), and not epoch time in isolation (which conflates dataset size with training speed). Samples per second tells you exactly how much work your training pipeline does per unit time, making it the right denominator for comparing any two configurations.

python

import time
 
def benchmark(model, train_loader, device, num_batches=100):
    torch.cuda.synchronize()
    start = time.time()
 
    for i, (batch, labels) in enumerate(train_loader):
        batch, labels = batch.to(device), labels.to(device)
 
        with autocast():
            output = model(batch)
            loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
 
        if i >= num_batches:
            break
 
    torch.cuda.synchronize()
    elapsed = time.time() - start
    total_samples = num_batches * batch_size
 
    print(f"Throughput: {total_samples / elapsed:.0f} samples/sec")
    print(f"Time per batch: {elapsed / num_batches * 1000:.2f}ms")

The torch.cuda.synchronize() calls are critical, GPU operations are asynchronous, so you need to force them to finish before timing. Without synchronization, your timing measurements reflect when operations were submitted to the GPU rather than when they completed, which can make a slow training loop look deceptively fast.

Key metrics to track:

Samples/sec (higher is better)
GPU utilization (nvidia-smi shows this)
Memory usage (should be 80-90% of total)
Convergence speed (wall-clock time to desired loss, not epoch count)

Cloud GPUs: Where to Train at Scale

You don't always have hardware. Cloud GPU options:

Google Colab (~$10/month for Pro):

Free tier: K80 (old, slow)
Pro: A100 or T4 (decent)
Good for prototyping, not sustained training

Lambda Labs ($0.50-$1.20/hr):

Nvidia A6000, A5000, V100
Bare metal (no shared, no oversubscription)
Best price-to-performance for serious work

RunPod (~$0.44/hr for A40):

On-demand spot instances
Community cloud (cheap) or secure cloud (more expensive)
Good for distributed training (easy multi-GPU setup)

AWS SageMaker (variable, usually expensive):

Enterprise features (monitoring, integration)
Overkill for most research, but production-ready

Paperspace Gradient (~$0.51/hr for A4000):

Easy notebook interface
Persistent storage
Decent for medium-term projects

The play: Start on Colab (free), prototype locally with a smaller model, then scale on Lambda or RunPod once you know it works.

Putting It Together: A Real Example

Let's train a ResNet-50 on CIFAR-10 with all the tricks. Notice how each technique layers on top of the others without requiring changes to model architecture or the training objective, this is the composability of PyTorch's training primitives working in your favor.

python

import torch
import torch.nn as nn
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast, GradScaler
from torch.utils.checkpoint import checkpoint
 
# Setup
device = torch.device("cuda")
batch_size = 256
num_epochs = 5
 
# Data with augmentation
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
train_set = datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=4)
 
# Model with mixed precision + gradient checkpointing
model = models.resnet50(num_classes=10).to(device)
model = torch.compile(model)  # PyTorch 2.0
 
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
scaler = GradScaler()
 
# Training
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
 
        with autocast():
            outputs = model(images)
            loss = criterion(outputs, labels)
 
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
 
        if (i + 1) % 50 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Step [{i+1}], Loss: {loss.item():.4f}")
 
print("Training complete!")

On an A100:

Without optimizations: ~200 samples/sec
With mixed precision: ~400 samples/sec (2x)
Add torch.compile: ~500+ samples/sec (2.5x total)
Distribute across 4 GPUs: ~1500+ samples/sec (7-8x with DDP overhead)

Real numbers. Real impact. The numbers compound because the techniques address different bottlenecks, mixed precision increases GPU throughput, torch.compile reduces kernel launch overhead, and DDP multiplies available hardware. You're not trading off against one technique to enable another; they stack.

The Path Forward

GPU training at scale isn't magic. It's engineering.

Start small:

Profile memory (understand your bottleneck)
Enable mixed precision (easy 1.5-2x)
Increase batch size (squeeze GPU utilization)
Add gradient accumulation (bigger batches, same memory)
Use torch.compile (free speedup on modern PyTorch)
Go distributed (when one GPU isn't enough)

Each step compounds. You're not chasing 10x speedups, you're chasing steady 2x improvements that multiply.

That ResNet training? 2 hours becomes 15 minutes. Your transformer? 5 days becomes 12 hours.

That's the real-world impact of understanding GPU fundamentals.

Where This Takes You

The techniques in this article represent the current standard practice for training deep learning models at any serious scale. They're not experimental, they're what engineers at Google, Meta, and every major AI lab use as their baseline before considering anything more exotic. The fact that PyTorch has made them this accessible is genuinely remarkable; five years ago, implementing mixed precision or DDP correctly required deep CUDA expertise and custom C++ code.

What comes next, once you've internalized this toolkit, is model evaluation and interpretability, understanding not just whether your model trained successfully, but whether it learned the right things, where it fails, and why. A model that trains in 12 hours instead of 5 days is only valuable if you can trust its outputs, understand its failure modes, and explain its decisions. The speed gains you've achieved here buy you the iteration budget to actually do that interpretability work rigorously, running ablations and diagnostic experiments that would have been prohibitively expensive before.

The deeper principle is that engineering and science reinforce each other in ML. Faster training isn't just about saving money or time, it's about compressing the feedback loop that separates hypotheses from evidence. Every optimization technique you master here is really an investment in your ability to learn faster, iterate more boldly, and build systems you actually understand. That's the return on this investment, and it compounds for the entire life of your career in this field.

Training at Scale: GPU Computing, Mixed Precision, and Distributed Training

Why Scaling Training Is the Real Bottleneck in Modern AI

Why GPU Training Matters (And Why It's Not Automatic)

GPU Computing Fundamentals

GPU Memory: The Hard Ceiling

Profiling Your Memory Usage

Batch Size: The Tuning Knob

Gradient Checkpointing: Memory for Compute

Mixed Precision: Speed Without Loss

torch.cuda.amp: Your New Best Friend

Real Numbers: Mixed Precision Performance

Gradient Accumulation: Bigger Batches Without More Memory

Distributed Training Strategies

DataParallel (Quick and Dirty)

DistributedDataParallel (The Real Deal)

Torch.compile: PyTorch 2.0's Hidden Weapon

Hugging Face Accelerate: Distributed Training for Mortals

Common Scaling Mistakes

Throughput Benchmarking: Measuring What Matters

Cloud GPUs: Where to Train at Scale

Putting It Together: A Real Example

The Path Forward

Where This Takes You

Need help implementing this?