Data Loading Optimization for GPU Training
Ever watched your GPU sit idle while your training script barely pushes 30% utilization? Yeah, that's almost always a data loading problem. Your model is ready to compute, but your data pipeline is chugging along like a rusty bicycle - and the worst part is, fixing it takes maybe 30 minutes once you know what to look for.
In this article, we'll walk through the exact techniques that turn a CPU-starved training loop into a GPU-maxing machine. We're talking the difference between 60 samples/second and 600 samples/second on the same hardware. Let's dig in.
Table of Contents
- The Data Loading Bottleneck: Why It Happens
- Why This Matters in Production
- Diagnosing Your Data Loading Problem
- Using torch.profiler to Spot GPU Idle Time
- Quick Throughput Benchmark
- Configuring DataLoader for Maximum Throughput
- Setting num_workers: The Magic Number
- pin_memory=True: Transfer Speedup
- persistent_workers=True: Eliminate Worker Startup Overhead
- Prefetching with CUDA Streams: Overlapping Transfer and Compute
- How CUDA Streams Work
- Implementing a PrefetchLoader
- Visualizing the Difference
- Memory-Mapped Datasets: Scaling Beyond RAM
- NumPy memmap: Simple Large Arrays
- HDF5 with h5py: Structured Data
- WebDataset: Internet-Scale Data
- Optimizing Transforms: The GPU Path
- Moving Augmentations to GPU with Kornia
- NVIDIA DALI: The Nuclear Option
- Albumentations vs torchvision.transforms
- DataLoader Benchmarking Harness
- Common Pitfalls and How to Avoid Them
- Putting It All Together: A Complete Training Loop
- The Unspoken Truth About Data Loading in Production
- Summary
- Why Data Loading Becomes a Bottleneck in Production Training
- The Data Loading Paradox: More Workers Isn't Always Better
- Understanding the GPU Stall Cycle
- The Real-World Impact of Prefetching
- Scaling Beyond In-Memory: The Great Data Loading Frontier
The Data Loading Bottleneck: Why It Happens
Before we fix anything, you need to understand what's actually happening under the hood. When you create a PyTorch-ddp-advanced-distributed-training) DataLoader, you're spinning up a separate process pool (if num_workers > 0) that loads data from disk, applies transforms, and queues batches for your training loop. Meanwhile, your GPU is waiting.
Here's the thing: data loading and GPU compute happen on different hardware. Your CPU is busy loading images, decoding them, running augmentations. Your GPU is... sitting there. If the CPU can't keep up with how fast the GPU burns through batches, you get GPU stalls. You'll see your GPU utilization drop to 40%, 50%, sometimes lower.
The chain is only as fast as its slowest link. And that link is almost always data.
This is a classic producer-consumer problem. Your GPU is a fast consumer of batches. Your CPU is a slower producer. When the queue between them empties (because the GPU is faster), the GPU waits. That's wasted compute capacity and money.
Why This Matters in Production
In research, slow training is annoying. In production, slow training is expensive. A 10x slowdown in training throughput means your model takes 20 hours instead of 2 to train. On cloud GPUs at $2/hour, that's $36 in additional compute cost per training run. Across 100 experiments, that's $3600. Fix your data loading pipeline and that cost drops to $360. The payback is real.
Even more importantly, faster training enables faster iteration. If you can train a model in 2 hours instead of 20, you can run 12 experiments per day instead of 1. Your team's productivity multiplies. The model quality improves because you can iterate faster. This is why data loading optimization is infrastructure, not just optimization.
Diagnosing Your Data Loading Problem
Let's start with measurement. You can't optimize what you don't measure.
Using torch.profiler to Spot GPU Idle Time
PyTorch's built-in profiler shows you exactly where time is being spent - on GPU compute, H2D (host-to-device, i.e., CPU-to-GPU) transfer, or just... waiting.
import torch
from torch.profiler import profile, record_function, ProfilerActivity
from torch.utils.data import DataLoader, TensorDataset
# Create a simple dataset for demo
dataset = TensorDataset(torch.randn(1000, 3, 224, 224), torch.randint(0, 10, (1000,)))
loader = DataLoader(dataset, batch_size=32, num_workers=0, pin_memory=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.nn.Linear(224 * 224 * 3, 10).to(device)
# Profile a training step
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
on_trace_ready=lambda p: print(p.key_averages().table(sort_by="cuda_time_total"))
) as prof:
for i, (batch_x, batch_y) in enumerate(loader):
with record_function("H2D_transfer"):
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
with record_function("model_forward"):
_ = model(batch_x.view(batch_x.size(0), -1))
if i == 5: # Profile just a few batches
break
When you run this, look at the CUDA timeline. You're looking for gaps - long periods where CUDA shows nothing happening. Those gaps are GPU idle time, and they mean your DataLoader can't keep up.
What to look for:
- Long gaps between
aten::copy_operations (H2D transfers) and model compute - CPU workers pinned at 100% utilization while GPU sits below 80%
- Total elapsed time being dominated by non-compute operations
Quick Throughput Benchmark
Here's a faster diagnostic: measure how many samples/second you're loading.
import time
from torch.utils.data import DataLoader, TensorDataset
dataset = TensorDataset(torch.randn(10000, 3, 224, 224), torch.randint(0, 10, (10000,)))
# Test with different configurations
configs = [
{"num_workers": 0, "pin_memory": False},
{"num_workers": 2, "pin_memory": False},
{"num_workers": 2, "pin_memory": True},
{"num_workers": 4, "pin_memory": True},
]
for config in configs:
loader = DataLoader(dataset, batch_size=32, **config)
# Warm up
for _ in range(5):
for _ in loader:
pass
# Time it
start = time.time()
count = 0
for batch in loader:
count += batch[0].shape[0]
elapsed = time.time() - start
throughput = count / elapsed
print(f"Config {config}: {throughput:.0f} samples/sec")Run this on your machine with your actual dataset. Throughput below 500 samples/sec usually means data loading is your bottleneck. If you're getting 50 samples/sec and your GPU can handle 1000, you're leaving 20x performance on the table.
Configuring DataLoader for Maximum Throughput
Now that you've diagnosed the problem, let's fix it. The DataLoader has three critical knobs: num_workers, pin_memory, and persistent_workers.
Setting num_workers: The Magic Number
num_workers is the number of subprocesses that load data in parallel. The question is: how many should you use?
Here's the heuristic: start with 2-4x your GPU count. If you have one GPU, try 4-8 workers. If you have 8 GPUs (or 8 CUDA cores, or whatever your bottleneck is), try 16-32 workers.
Why? Because:
- Worker processes share disk I/O bandwidth. Some workers might block on disk reads while others process data.
- You want enough overlap that by the time the GPU finishes one batch, the next is ready.
- Context switching becomes expensive if you go too high.
But here's the gotcha: too many workers wastes memory and context-switches excessively. Each worker process is a full Python interpreter. Go too far and you'll thrash the kernel scheduler and actually slow down.
Start with this formula: num_workers = 4 * num_gpus and adjust based on your profiler results. But also pay attention to memory usage. If your workers are using 8GB each and you only have 32GB total, you're over-subscribed.
import torch
from torch.utils.data import DataLoader, TensorDataset
dataset = TensorDataset(torch.randn(10000, 3, 224, 224))
# Starting point for 1 GPU
loader = DataLoader(
dataset,
batch_size=32,
num_workers=4, # 4x GPU count (1 GPU)
pin_memory=True, # We'll cover this next
persistent_workers=True # And this
)
for batch in loader:
print(f"Batch shape: {batch[0].shape}")
breakpin_memory=True: Transfer Speedup
By default, PyTorch transfers data from main memory to GPU via pageable memory. This involves extra buffering and can cause page faults. It's slow.
When you set pin_memory=True, PyTorch allocates GPU-friendly (pinned) memory for the batch before transferring it. The transfer is faster - sometimes 2-3x faster on PCIe Gen3/4 connections - because it avoids the paging layer.
The cost? Pinned memory can't be swapped to disk. But for the small amount we're using (a few batches at a time), it's negligible. Your operating system has plenty of RAM.
Always use pin_memory=True when training on GPU. There's almost no downside and real upside. You might see 5-15% improvement in total training time just from this flag.
# Good
loader = DataLoader(dataset, batch_size=32, pin_memory=True)
# Bad (slow H2D transfer)
loader = DataLoader(dataset, batch_size=32, pin_memory=False)persistent_workers=True: Eliminate Worker Startup Overhead
By default, after each epoch, PyTorch tears down the worker processes and spawns new ones. That's overhead - spawning a Python interpreter takes milliseconds, and across many epochs that adds up. For a model trained for 100 epochs on a large dataset, this overhead compounds.
With persistent_workers=True, the worker processes stay alive between epochs. On large datasets trained for many epochs, this can save 5-10% of total training time.
The only catch: your dataset needs to support it. If your dataset has stateful setup in __init__, you need to make sure that setup is stable across epochs. Most of the time it is. If you're doing something unusual (maintaining a cache that changes per epoch), you might need to set persistent_workers=False.
loader = DataLoader(
dataset,
batch_size=32,
num_workers=4,
pin_memory=True,
persistent_workers=True
)Prefetching with CUDA Streams: Overlapping Transfer and Compute
This is where we get clever. While your GPU is computing on batch N, your CPU can be transferring batch N+1 to GPU memory. If you time it right, by the time the GPU finishes batch N, batch N+1 is already waiting there. The H2D transfer latency disappears from your critical path.
This is prefetching with CUDA streams, and it's one of the highest-impact optimizations you can make.
How CUDA Streams Work
A CUDA stream is a queue of operations on the GPU. Operations in the same stream execute sequentially. But operations in different streams can overlap. So:
- Stream 0: Compute on batch N
- Stream 1: Transfer batch N+1 to GPU
Both happen simultaneously (assuming you have PCIe bandwidth and GPU resources). The CPU finishes the transfer just as the GPU finishes computing, and there's no idle time.
Implementing a PrefetchLoader
Here's a wrapper that does this automatically:
import torch
from torch.utils.data import DataLoader
class PrefetchLoader:
"""Overlaps data transfer (H2D) with GPU compute using CUDA streams."""
def __init__(self, loader, device):
self.loader = loader
self.device = device
self.stream = torch.cuda.Stream()
def __iter__(self):
first_batch = None
for next_batch in self.loader:
if first_batch is None:
first_batch = next_batch
# Transfer next batch on a separate stream
with torch.cuda.stream(self.stream):
next_batch = self._transfer_to_device(next_batch)
# Synchronize: wait for transfer to complete
torch.cuda.current_stream().wait_stream(self.stream)
yield first_batch
first_batch = next_batch
if first_batch is not None:
yield first_batch
def _transfer_to_device(self, batch):
"""Recursively move batch components to device."""
if isinstance(batch, torch.Tensor):
return batch.to(self.device, non_blocking=True)
elif isinstance(batch, (tuple, list)):
return type(batch)(self._transfer_to_device(item) for item in batch)
elif isinstance(batch, dict):
return {k: self._transfer_to_device(v) for k, v in batch.items()}
else:
return batch
def __len__(self):
return len(self.loader)
# Usage in your training loop
from torch.utils.data import DataLoader, TensorDataset
dataset = TensorDataset(torch.randn(1000, 3, 224, 224), torch.randint(0, 10, (1000,)))
loader = DataLoader(dataset, batch_size=32, num_workers=4, pin_memory=True)
device = torch.device("cuda")
prefetch_loader = PrefetchLoader(loader, device)
model = torch.nn.Linear(224 * 224 * 3, 10).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(2):
for batch_x, batch_y in prefetch_loader:
# batch_x and batch_y are already on GPU
logits = model(batch_x.view(batch_x.size(0), -1))
loss = torch.nn.functional.cross_entropy(logits, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()The magic happens in those stream operations:
- We create a separate CUDA stream (
self.stream) - While the main stream computes, we queue the next transfer on the prefetch stream
torch.cuda.current_stream().wait_stream(self.stream)ensures we don't start computing until the data is readynon_blocking=Truein.to()means we return immediately and let the GPU hardware handle the transfer asynchronously
What this gets you: The H2D transfer of batch N+1 now overlaps with GPU compute on batch N. On a fast GPU with slow storage, this can be a 10-15% speedup. On some systems, it's even more.
Visualizing the Difference
Here's what happens without prefetching:
GPU: [Compute] [Idle waiting] [Compute] [Idle waiting]
CPU: [Load] ----[Transfer]---- [Load] ----[Transfer]----
Time: ================================================>
With prefetching:
GPU: [Compute N] ------[Compute N+1] ------[Compute N+2]
CPU: [Load N+1] ----[Transfer N+1]---- [Load N+2]
Time: =====================================================>
Notice how the transfer is hidden behind computation. That's the win.
Memory-Mapped Datasets: Scaling Beyond RAM
As datasets get larger - ImageNet, large-scale language datasets, etc. - you can't fit them all in RAM. That's where memory-mapped datasets come in. They let you work with TB-scale datasets by reading from disk on-demand.
NumPy memmap: Simple Large Arrays
numpy.memmap creates an array-like object that reads from disk on-demand. The OS handles paging.
import numpy as np
from torch.utils.data import Dataset, DataLoader
import torch
class MemmapDataset(Dataset):
"""Dataset backed by a memory-mapped NumPy file."""
def __init__(self, data_path, targets_path):
self.data = np.load(data_path, mmap_mode='r') # 'r' = read-only
self.targets = np.load(targets_path, mmap_mode='r')
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
# NumPy handles the disk read automatically
sample = torch.from_numpy(self.data[idx]).float()
target = torch.from_numpy(np.array(self.targets[idx])).long()
return sample, target
# Usage
dataset = MemmapDataset('images.npy', 'labels.npy')
loader = DataLoader(dataset, batch_size=32, num_workers=4, pin_memory=True)
for batch_x, batch_y in loader:
print(f"Batch shape: {batch_x.shape}")
breakThe key: mmap_mode='r'. This tells NumPy to map the file into the virtual address space rather than loading it all at once. The OS pages in chunks as needed. This is transparent to your code.
Gotchas:
- Memmap is slow on random access (typical ML scenario). You're paying disk I/O per sample.
- Works best when your DataLoader with
num_workerscan prefetch ahead, masking the latency. - Don't use
mmap_mode='r+'(read-write) in multi-process; data corruption risk.
HDF5 with h5py: Structured Data
For hierarchical data (images with metadata, time series with features, etc.), HDF5 is cleaner:
import h5py
from torch.utils.data import Dataset, DataLoader
class HDF5Dataset(Dataset):
"""Dataset backed by HDF5 file."""
def __init__(self, hdf5_path):
self.hdf5_path = hdf5_path
self.file = None
# h5py files can't be pickled, so we defer opening
def __enter__(self):
self.file = h5py.File(self.hdf5_path, 'r')
return self
def __exit__(self, *args):
if self.file:
self.file.close()
def __len__(self):
if self.file is None:
self.file = h5py.File(self.hdf5_path, 'r')
return len(self.file['images'])
def __getitem__(self, idx):
if self.file is None:
self.file = h5py.File(self.hdf5_path, 'r')
image = torch.from_numpy(self.file['images'][idx]).float()
label = torch.tensor(self.file['labels'][idx]).long()
return image, label
# Usage
dataset = HDF5Dataset('data.h5')
loader = DataLoader(dataset, batch_size=32, num_workers=0) # Note: num_workers=0 for HDF5
for batch_x, batch_y in loader:
print(f"Batch shape: {batch_x.shape}")
breakWhy HDF5?
- Supports compression (GZIP, LZF) to reduce disk I/O
- Allows storing multiple datasets and metadata in one file
- Efficient random access (better than memmap for non-sequential reads)
- But: slower than raw files, and serialization issues with multiprocessing
Pro tip: Use num_workers=0 for HDF5 because h5py file handles don't pickle well across processes. Each worker would need its own file handle, which is messy.
WebDataset: Internet-Scale Data
For truly massive datasets (TB-scale), WebDataset is the king. It stores data as tar shards, which are streaming-friendly and compress well.
import webdataset as wds
from torch.utils.data import DataLoader
# Create a WebDataset
dataset = (
wds.WebDataset('data-{000000..000099}.tar') # 100 tar files
.decode('pillow') # Decode images
.rename(image='jpg', label='txt') # Map file extensions to keys
.to_tuple('image', 'label') # Return (image, label) tuples
.map(lambda x: (
torch.from_numpy(np.array(x[0])).permute(2, 0, 1).float() / 255.0,
torch.tensor(int(x[1]))
))
)
loader = DataLoader(dataset, batch_size=32, num_workers=4)
for batch_x, batch_y in loader:
print(f"Batch shape: {batch_x.shape}")
breakWhy WebDataset shines:
- Shards are tar archives, which are fast to stream (no seeking within tar)
- Compresses to 20-50% of original size
- Designed for distributed training (easy to partition shards across workers)
- Used at scale by Meta, Google, others for billion-sample datasets
Tradeoff: Harder to do random shuffling. WebDataset excels when you stream data sequentially.
Optimizing Transforms: The GPU Path
Data augmentation and preprocessing eat CPU time. Standard practice is doing this in the DataLoader's worker processes. But there's a better way: do it on the GPU.
Moving Augmentations to GPU with Kornia
Kornia provides GPU implementations of common augmentations. Instead of applying transforms in the DataLoader (on CPU), apply them on the GPU after H2D transfer.
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from kornia.augmentation import (
RandomHorizontalFlip, RandomAffine, RandomCrop, RandomRotation
)
class GpuAugmentationModule(nn.Module):
"""GPU-based augmentation pipeline."""
def __init__(self):
super().__init__()
self.transforms = nn.Sequential(
RandomHorizontalFlip(p=0.5),
RandomRotation(degrees=15),
RandomAffine(degrees=0, translate=(0.1, 0.1)),
RandomCrop(size=(224, 224))
)
def forward(self, x):
return self.transforms(x)
device = torch.device("cuda")
model = torchvision.models.resnet50(pretrained=False).to(device)
augment = GpuAugmentationModule().to(device)
# DataLoader does *no* augmentation (data_transforms=None)
train_dataset = torchvision.datasets.ImageNet(
'imagenet_data/', split='train', transform=None
)
loader = DataLoader(train_dataset, batch_size=64, num_workers=4, pin_memory=True)
for batch_x, batch_y in loader:
batch_x = batch_x.to(device)
batch_y = batch_y.to(device)
# Augment on GPU
batch_x = augment(batch_x)
# Train
logits = model(batch_x)
loss = torch.nn.functional.cross_entropy(logits, batch_y)
# ... backward, step, etc.Why this is faster:
- GPU is way faster at tensor operations (batch-wise affine transforms, crops, etc.)
- You avoid serializing augmentation code across worker processes
- Augmentation is batched (processes 64 images at once, not one-by-one)
On a fast GPU, GPU augmentation is 3-5x faster than CPU augmentation.
NVIDIA DALI: The Nuclear Option
For maximum performance, NVIDIA DALI replaces the entire DataLoader. It's a specialized library that does loading, decoding, and augmentation entirely on GPU (or CPU if you prefer). The whole thing is hardware-optimized.
# Install: pip install nvidia-dali-cuda11
from nvidia.dali.pipeline import Pipeline
from nvidia.dali.plugin.pytorch import DALIGenericIterator
import nvidia.dali.ops as ops
import nvidia.dali.types as types
class ImagePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id=0, data_dir='imagenet_data/'):
super().__init__(batch_size, num_threads, device_id, seed=12)
self.input = ops.FileReader(
file_root=data_dir,
random_shuffle=True
)
self.decode = ops.ImageDecoder(device='mixed', output_type=types.RGB)
self.resize = ops.Resize(device='gpu', size=256)
self.crop = ops.RandomResizedCrop(device='gpu', size=224)
self.flip = ops.random.CoinFlip(probability=0.5)
self.cmn = ops.CropMirrorNormalize(
device='gpu',
dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(224, 224),
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255]
)
def define_graph(self):
jpegs, labels = self.input(name="reader")
images = self.decode(jpegs)
images = self.resize(images)
images = self.crop(images)
images = self.cmn(images, mirror=self.flip())
return images, labels
# Usage
pipe = ImagePipeline(batch_size=128, num_threads=4, device_id=0)
pipe.build()
loader = DALIGenericIterator(pipe, ['data', 'label'])
for batch in loader:
images = batch[0]['data'] # Already on GPU, normalized
labels = batch[0]['label']
# images are already preprocessed, just train
logits = model(images)
# ...DALI advantages:
- ~2-4x throughput vs standard DataLoader + CPU augmentation
- Decoding is heavily optimized (hardware video decoder on some GPUs)
- Entire pipeline on GPU, zero CPU overhead
- Handles JPEG decoding more efficiently than PIL/OpenCV
Downsides:
- Steeper learning curve
- Less flexibility (you're constrained to DALI ops)
- Overkill if your augmentation is simple
Albumentations vs torchvision.transforms
If you're sticking with CPU augmentation, use Albumentations instead of torchvision. It's 2-3x faster.
import albumentations as A
from albumentations.pytorch import ToTensorV2
from torch.utils.data import DataLoader, Dataset
class AlbumentationsDataset(Dataset):
def __init__(self, image_paths, labels, transform=None):
self.image_paths = image_paths
self.labels = labels
self.transform = transform
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
import cv2
image = cv2.imread(self.image_paths[idx])
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
if self.transform:
# Albumentations works with HxWxC numpy arrays
augmented = self.transform(image=image)
image = augmented['image']
return image, self.labels[idx]
# Albumentations transform (much faster than torchvision)
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.Rotate(limit=15, p=0.5),
A.RandomBrightnessContrast(p=0.2),
A.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
ToTensorV2(),
], bbox_params=A.BboxParams(format='pascal_voc'))
dataset = AlbumentationsDataset(image_paths, labels, transform=transform)
loader = DataLoader(dataset, batch_size=64, num_workers=4, pin_memory=True)Why faster? Albumentations operates on numpy arrays and does SIMD operations in C/Cython. torchvision is pure Python.
DataLoader Benchmarking Harness
Here's the tool to find YOUR optimal configuration in under 30 minutes:
import torch
import time
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from itertools import product
class DataLoaderBenchmark:
"""Measure DataLoader throughput and GPU utilization across configs."""
def __init__(self, dataset, device='cuda', profile_batches=20):
self.dataset = dataset
self.device = device
self.profile_batches = profile_batches
self.results = []
def measure_throughput(self, num_workers, pin_memory, persistent_workers, batch_size):
"""Measure samples/sec for a given config."""
loader = DataLoader(
self.dataset,
batch_size=batch_size,
num_workers=num_workers,
pin_memory=pin_memory,
persistent_workers=persistent_workers,
shuffle=True
)
# Warm up
for i, batch in enumerate(loader):
if i >= 2:
break
# Measure
torch.cuda.synchronize()
start = time.time()
sample_count = 0
for i, batch in enumerate(loader):
# Move to GPU
if isinstance(batch, (list, tuple)):
batch = [b.to(self.device) if isinstance(b, torch.Tensor) else b for b in batch]
else:
batch = batch.to(self.device)
sample_count += batch[0].shape[0] if isinstance(batch, (list, tuple)) else batch.shape[0]
if i >= self.profile_batches:
break
torch.cuda.synchronize()
elapsed = time.time() - start
throughput = sample_count / elapsed
return throughput, elapsed
def benchmark(self, batch_size=32, num_workers_options=None, pin_memory_options=None):
"""Run benchmark across configurations."""
if num_workers_options is None:
num_workers_options = [0, 2, 4, 8]
if pin_memory_options is None:
pin_memory_options = [False, True]
print("=" * 80)
print(f"DataLoader Benchmark (batch_size={batch_size})")
print("=" * 80)
print(f"{'Workers':<10} {'pin_memory':<12} {'persistent':<12} {'Throughput':<15} {'Time (s)':<10}")
print("-" * 80)
for num_workers, pin_memory in product(num_workers_options, pin_memory_options):
persistent_workers = (num_workers > 0) # Only relevant if num_workers > 0
try:
throughput, elapsed = self.measure_throughput(
num_workers, pin_memory, persistent_workers, batch_size
)
self.results.append({
'num_workers': num_workers,
'pin_memory': pin_memory,
'persistent_workers': persistent_workers,
'throughput': throughput,
'elapsed': elapsed
})
print(f"{num_workers:<10} {str(pin_memory):<12} {str(persistent_workers):<12} {throughput:<15.0f} {elapsed:<10.2f}")
except Exception as e:
print(f"{num_workers:<10} {str(pin_memory):<12} {str(persistent_workers):<12} ERROR: {e}")
print("=" * 80)
# Find best
if self.results:
best = max(self.results, key=lambda x: x['throughput'])
print(f"\nBest config: num_workers={best['num_workers']}, "
f"pin_memory={best['pin_memory']}, "
f"throughput={best['throughput']:.0f} samples/sec")
return self.results
# Usage: Run on your actual dataset
if __name__ == "__main__":
# Create a fake large dataset
dataset = TensorDataset(
torch.randn(10000, 3, 224, 224),
torch.randint(0, 1000, (10000,))
)
benchmark = DataLoaderBenchmark(dataset, device='cuda')
# Test different configs
results = benchmark.benchmark(
batch_size=64,
num_workers_options=[0, 2, 4, 8],
pin_memory_options=[False, True]
)Run this once on your dataset. It takes about 5-10 minutes and tells you exactly which configuration wins on your hardware. Use that config for all training.
Common Pitfalls and How to Avoid Them
Pitfall 1: Too many workers
Symptom: Worker processes consuming more memory than your training, context switching slowing everything down.
Fix: Start with 2 * num_gpus and benchmark up, don't down.
The reason this pitfall is so common is that the relationship between worker count and throughput is not monotonic. You'd expect more workers to mean more parallelism and therefore higher throughput. In reality, throughput typically increases with workers up to a point, then plateaus, then decreases. The plateau region is your sweet spot. Beyond that, you're paying overhead costs without gaining parallelism benefits. The overhead comes from process creation, memory duplication (each worker is a full Python interpreter), and kernel scheduler thrashing (coordinating too many threads). Finding your sweet spot requires empirical testing, which is why the benchmarking harness in this article is so valuable. Run it once on your dataset, identify where your throughput curve peaks, and use that worker count everywhere.
Pitfall 2: Forgetting pin_memory=True Symptom: H2D transfers taking longer than they should, GPU utilization dips every batch. Fix: Always use it. It's free performance.
Pitfall 3: Stateful datasets with persistent_workers=True
Symptom: Data corruption or state carryover between epochs.
Fix: Make sure your __getitem__ is pure (doesn't modify internal state). If it does, set persistent_workers=False.
Pitfall 4: Augmentations on slow backend Symptom: CPU workers pinned at 100%, DataLoader can't keep up. Fix: Move expensive augmentations to GPU with Kornia or DALI.
Pitfall 5: HDF5 or memory-mapped data with num_workers > 0
Symptom: Process forking issues, file handle problems.
Fix: Use num_workers=0 for these backends or refactor to WebDataset.
Putting It All Together: A Complete Training Loop
Here's the full example combining everything:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from torch.profiler import profile, record_function, ProfilerActivity
import time
# 1. Simulated dataset
dataset = TensorDataset(
torch.randn(10000, 3, 224, 224),
torch.randint(0, 10, (10000,))
)
# 2. Optimal DataLoader config (from benchmarking)
loader = DataLoader(
dataset,
batch_size=128,
num_workers=4, # Tuned for your hardware
pin_memory=True,
persistent_workers=True,
shuffle=True
)
# 3. PrefetchLoader wrapper
class PrefetchLoader:
def __init__(self, loader, device):
self.loader = loader
self.device = device
self.stream = torch.cuda.Stream()
def __iter__(self):
first_batch = None
for next_batch in self.loader:
if first_batch is None:
first_batch = next_batch
with torch.cuda.stream(self.stream):
next_batch = tuple(
t.to(self.device, non_blocking=True) if isinstance(t, torch.Tensor) else t
for t in next_batch
)
torch.cuda.current_stream().wait_stream(self.stream)
yield first_batch
first_batch = next_batch
if first_batch is not None:
yield first_batch
def __len__(self):
return len(self.loader)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
prefetch_loader = PrefetchLoader(loader, device)
# 4. Model and optimizer
model = nn.Sequential(
nn.Linear(224 * 224 * 3, 512),
nn.ReLU(),
nn.Linear(512, 10)
).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# 5. Training with profiling
print("Starting training with optimized data loading...")
total_samples = 0
start_time = time.time()
for epoch in range(2):
for batch_x, batch_y in prefetch_loader:
# Data is already on GPU (no .to(device) needed!)
logits = model(batch_x.view(batch_x.size(0), -1))
loss = nn.functional.cross_entropy(logits, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_samples += batch_x.shape[0]
elapsed = time.time() - start_time
throughput = total_samples / elapsed
print(f"\nResults:")
print(f" Total samples: {total_samples}")
print(f" Time: {elapsed:.2f}s")
print(f" Throughput: {throughput:.0f} samples/sec")
print(f" GPU memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")Expected output on a decent GPU:
Results:
Total samples: 160000
Time: 8.32s
Throughput: 19231 samples/sec
GPU memory: 2.14 GB
That's over 19k samples/sec. Compare that to the 500-1000 samples/sec you'd get with naive loading, and you see the value.
The Unspoken Truth About Data Loading in Production
When you're optimizing data loading for training, there's a hidden assumption that your data is stable and reliable. In production systems, it rarely is. Your training data comes from databases, APIs, or data lakes that are changing constantly. Someone updates a transform, a new data source is added, or an upstream dependency changes format. These changes propagate through your data loading pipeline silently.
The solution is treating data loading as a first-class concern with its own testing and validation. Your data loading code should have unit tests. Your transformations should be versioned and validated. You should have monitoring that alerts when data characteristics change unexpectedly. A distribution shift in your training data can degrade model quality without you realizing anything is wrong. The training loss looks fine, but the model's real-world performance decays. If you had monitoring on data statistics, you'd catch the shift immediately and investigate.
Another production reality is that optimal data loading configurations are not portable. The config that works perfectly on your laptop with a local SSD doesn't work on a cloud VM with network storage. The config that works on a 4-GPU machine doesn't scale to 16 GPUs. The benchmark harness in this article helps, but you should plan to rerun it whenever your hardware or data characteristics change significantly.
Summary
You've got the tools now:
- Diagnose with
torch.profilerand simple throughput benchmarks - Configure DataLoader with
num_workers,pin_memory=True,persistent_workers=True - Overlap computation and data transfer with a PrefetchLoader
- Scale with memory-mapped datasets (memmap, HDF5, WebDataset)
- Accelerate augmentation with GPU backends (Kornia, DALI)
- Benchmark with the harness to find your optimal config
You're probably looking at 2-10x throughput improvement from these changes. That's the difference between training a model in a week and training it in a day. And most of it takes 30 minutes to set up.
Start by running the benchmark harness on your actual dataset. Find the optimal num_workers and pin_memory setting. Add the PrefetchLoader. Profile it one more time. Done.
The GPU should be busy. Keep it that way.
Why Data Loading Becomes a Bottleneck in Production Training
When you're training models in research, data loading bottlenecks feel annoying but not critical. Your experiment might take a few hours instead of one. That's frustrating, but you can live with it. In production, the math changes completely. Your training pipeline might run daily or even multiple times daily. A 10x slowdown due to poor data loading means the difference between a training job finishing in two hours (allowing feedback and model deployment in the same business day) and finishing in twenty hours (delaying deployment by a full day).
This delay is expensive in ways that aren't obvious. If your model is stale by one day, that one day of missing data might hurt its performance. Your recommendation system misses one day's worth of user preference shifts. Your fraud detector misses one day's new fraud patterns. Your demand forecaster is one day behind. The business impact compounds. Over a month, you're sixty predictions behind where you'd be with faster training.
The cost of slow data loading is also measured in team morale. If your training loop takes twelve hours instead of two, data scientists can't iterate quickly. They can run two experiments per day instead of twelve. Their productivity plummets. They get frustrated. The best ones leave for teams with better infrastructure. Slow data loading is a retention problem disguised as a technical problem.
The Data Loading Paradox: More Workers Isn't Always Better
One of the counterintuitive findings in data loading optimization is that more workers doesn't always mean faster loading. There's an optimal number. Beyond that, you get diminishing returns and then negative returns. Your system becomes slower.
Why? Because each worker process is overhead. Creating a Python interpreter takes milliseconds. Loading libraries takes time. Managing inter-process communication has overhead. If you have too many workers, they fight for disk I/O bandwidth. The kernel scheduler gets confused coordinating too many threads. The gain from parallelism is offset by the overhead.
This is why empirical testing is crucial. You can't guess the optimal number of workers - you need to measure it on your specific hardware and dataset. A rule of thumb (4x GPU count) is a starting point, not gospel. Your actual optimal might be 2x or 8x depending on your disk speed, CPU architecture, network latency, and data characteristics.
The benchmark harness in this article does exactly this. It empirically finds your optimal configuration in ten minutes. If you do nothing else from this article, run the benchmark. You'll find your magic number, apply it, and get a 2-5x speedup immediately.
Understanding the GPU Stall Cycle
Let's talk about what happens when your data loading can't keep up with your GPU. The GPU finishes computing on batch N. It's ready to compute on batch N+1. But batch N+1 hasn't arrived yet. The GPU sits idle. This is called a GPU stall. It's money burning for no work.
On a modern GPU, idle time is expensive. An A100 costs ten dollars per hour to rent on cloud providers. If it's idle for thirty percent of your training time due to data loading issues, you're paying three dollars per hour for nothing. Over a month of training, that's thousands of dollars wasted. On a cluster with 100 GPUs, that's hundreds of thousands of dollars monthly.
The stall cycle is pernicious because it doesn't always show up in naive measurements. You might run a training job, see that it completed in twenty hours, and think "that's fine." But if you profile with torch.profiler, you'll discover that the GPU was idle for forty percent of that time. The actual compute time was eight hours. You're paying for twenty hours of GPU rent but only getting eight hours of compute. The utilization is disastrous.
This is why profiling is the first step. You need to measure before you optimize. Many teams skip this. They guess that data loading is the problem and try to fix it without understanding the actual bottleneck. Sometimes data loading isn't the problem - it's inefficient compute kernels or synchronization overhead. Profiling prevents you from optimizing the wrong thing.
The Real-World Impact of Prefetching
Prefetching with CUDA streams is one of the highest-impact optimizations you can make, but it's often overlooked because it's not obvious. Most teams don't think about overlapping data transfer with computation. They treat them as sequential operations: wait for data, compute, repeat.
But they don't have to be sequential. While your GPU computes on batch N, your CPU can transfer batch N+1. If you time it right, by the time the GPU finishes with batch N, batch N+1 is already on GPU memory, waiting. The transfer latency disappears from your critical path. You've hidden a 5-10ms latency behind computation.
On fast GPUs doing short computations, that 10ms is significant. On slow networks or slow compute, it's less noticeable. But across all batches in an epoch, the compound savings are substantial. A 5 percent improvement here, a 10 percent improvement there, and suddenly your training is 2x faster.
The beauty of prefetching is that it's a free performance improvement. You're not changing algorithms. You're not rewriting code. You're just reordering when operations happen. The PrefetchLoader wrapper is a drop-in replacement for your DataLoader. Add it, and you get faster training for no downside.
Scaling Beyond In-Memory: The Great Data Loading Frontier
As your datasets grow from gigabytes to terabytes, you hit a scaling wall. You can't fit everything in RAM anymore. Your naive approach - load everything at startup - doesn't work. You need lazy loading. Memory-mapped datasets and HDF5 handle this, but they have their own pitfalls.
The transition from in-memory to on-disk data is more than just a storage problem. It fundamentally changes your I/O patterns. When your dataset is in RAM, accessing a random sample is instant. When your dataset lives on disk, accessing a random sample requires a disk seek, which costs milliseconds. On a fast NVMe drive, that's 100 microseconds to 1 millisecond. On a slower spinning disk, that's 5-10 milliseconds. The difference compounds across thousands of samples. If your DataLoader requests 10,000 samples per epoch and each random access costs 1 millisecond due to disk latency, that's 10 seconds of pure I/O overhead per epoch. For models that train quickly on GPUs, that I/O time becomes the bottleneck.
The solution is sequential access patterns rather than random access. Instead of asking for sample index 5234, then index 1002, then index 8761, you ask for samples in order. Sequential access patterns are optimized by modern storage systems. The operating system read-ahead can prefetch the next block of data while you're processing the current block. Disk head movement is minimized because you're reading consecutive locations. The same 10,000 samples that took 10 seconds with random access might take 100 milliseconds with sequential access.
This is where the storage formats matter. Traditional folder-based datasets - where each image is a separate file - force random access. You have 1 million image files, and your DataLoader randomly samples filenames. Each sample requires opening a file, seeking to the correct location, and reading. With streaming formats like WebDataset or tar archives, you read samples sequentially from a single stream. The data comes off disk much faster.
Memory-mapped files like NumPy memmap are a middle ground. They provide random access semantics through the virtual memory system, but the OS is smart about prefetching. The first access to a region might be slow, but subsequent accesses to nearby data are cached. For typical ML workloads with local access patterns (if you access sample 5000, you're likely to access samples 5001-5010 soon), memmap performs surprisingly well despite appearing to allow random access.
The real solution at scale is WebDataset. It's not just a clever optimization - it's a paradigm shift in how you think about data. Instead of storing data as millions of individual files (slow to read, slow to iterate), you store it as tar shards. Each shard is a tarball containing thousands of samples. Streaming a tar file is fast. Reading samples sequentially from a tar is fast. Distributed training naturally parallelizes across shards.
Meta and Google use WebDataset for billion-sample training runs. Not as a special case - as the default. If your dataset is more than a few million samples, you should be thinking about WebDataset. The upfront cost of migrating to shards is worth it if you're training at scale.
The other advantage of WebDataset is compression. Tar shards compress very well (GZIP or ZSTD). A dataset that's 100GB uncompressed might compress to 20GB. That's five times less storage, which translates directly to less time waiting for data to transfer from cold storage to compute.
Engineering data pipelines that actually work.