You've just deployed a Llama-3-70B model to production, and your inference latency is unacceptable. A single GPU can't hold the weights, and even if it could, token generation is painfully slow. You need to split that massive model across multiple GPUs - but which strategy: tensor parallelism, pipeline-pipelines-training-orchestration)) parallelism, or both? And more importantly, why does the choice matter so much for latency?

We're going to walk through the technical foundations, the hardware constraints, and practical configurations that let you squeeze maximum throughput out of your GPU cluster. By the end, you'll understand not just how these parallelism strategies work, but when to apply them and what they cost in communication overhead.

Multi-GPU inference is not a nice-to-have optimization; it's a requirement for serving large models in production. When you're trying to serve a model that doesn't fit in a single GPU's memory, you have to split it across multiple GPUs. When you're trying to serve a model with low latency despite high throughput demands, you need multiple GPUs working in parallel. The techniques for distributing inference work across GPUs are complex, but understanding them is essential for building scalable ML serving infrastructure.

The fundamental challenge is that modern large language models are enormous. A GPT-scale model might have 175 billion parameters. Each parameter is typically a 16-bit or 32-bit float, consuming two to four bytes per parameter. So the model weights alone consume 350 gigabytes to 700 gigabytes of memory. The largest individual GPUs available have around 80 gigabytes of memory. You cannot fit the model on a single GPU. You must distribute it somehow.

But it's not just model weights that consume memory. During inference, you need to hold intermediate activations, attention matrices, and key-value cache. For large sequences, the key-value cache alone can be gigabytes. As batch size grows, memory consumption grows even faster. So the problem is not just about fitting model weights; it's about fitting the entire working set.

The communication patterns matter as much as the memory consumption. When you split a model across GPUs, the GPUs need to communicate with each other frequently. Activations computed on one GPU need to be sent to the next GPU. Gradients (if you're fine-tuning) need to flow backward. Data transfers between GPUs are slow compared to computation. A naive implementation might spend more time moving data-pipeline-automated-model-compression)-fundamentals) than computing. So the parallelism strategy needs to minimize communication overhead while maximizing computation.

The choice of parallelism strategy has profound implications for performance and complexity. Tensor parallelism has high communication overhead but is relatively straightforward. Pipeline parallelism has lower communication overhead but introduces pipeline bubbles where GPUs sit idle. Sequence parallelism is more recent and handles very long sequences well. Experts need to understand these tradeoffs and choose the right strategy for their problem.

The Core Problem: Why Single-GPU Inference Breaks at Scale

Before we dive into solutions, let's be clear about the constraint. Many teams approach multi-GPU inference by accident. They deploy a single-GPU model to production, it works on smaller datasets, and then they scale up. They get requests queuing up, latency explodes, and they realize: "We need more GPUs." But knowing you need more GPUs and knowing how to use multiple GPUs effectively are different problems.

The naive approach is to replicate the model across GPUs and load-balance requests. You have 4 GPUs, you run 4 copies of Llama-3-70B, you route requests round-robin. This works but it's wasteful. Each GPU still holds 140GB of weights. You're burning 560GB of GPU memory to replicate the same model 4 times. You'd be better off buying more GPUs that do something useful rather than redundant copies. Plus, this approach doesn't help with batch processing or the memory wall of loading a 70B model at all.

Let's be clear about the constraint.

Llama-3-70B has 70 billion parameters. In float16 (the standard for inference), that's 140 GB of weights - nearly two A100 40GB GPUs just for parameter storage. Now add KV cache (key-value tensors kept in memory during generation), intermediate activations, and a batch of requests, and you're approaching 500+ GB easily. This doesn't fit on consumer hardware and strains even high-end enterprise GPUs.

Single-GPU inference also serializes computation. Each token in a sequence requires a full forward pass, and the attention mechanism is the bottleneck: O(seq_length²) complexity. For a 1024-token context, that's expensive per token. We need parallelism to overlap computation and memory access across devices.

Why This Matters: A single A100 running Llama-3-70B will generate tokens at roughly 10-15 tokens/second. But modern inference serving needs 100+ requests/second. Without parallelism, you'd need 7-10 A100s just for throughput, which is economically infeasible.

Tensor Parallelism: Splitting the Weights

Tensor parallelism (TP) splits large weight matrices horizontally or vertically across devices, so each GPU holds only a fraction of the parameters. The key insight: we can decompose linear layers to expose parallelism without changing the output shape.

How It Works: Q/K/V Splitting

Take the attention mechanism. In standard form, you compute:

python

# Standard attention (single GPU)
Q = input @ W_Q  # shape: (batch, seq_len, hidden_dim)
K = input @ W_K
V = input @ W_V
scores = Q @ K.T / sqrt(d_k)  # (batch, seq_len, seq_len)
output = softmax(scores) @ V   # (batch, seq_len, hidden_dim)

With tensor parallelism=4, we split W_Q, W_K, W_V along the output dimension:

python

# Tensor Parallel (TP=4): each GPU handles 1/4 of the output features
# GPU 0 computes: Q[:, :, 0:hidden_dim//4]
# GPU 1 computes: Q[:, :, hidden_dim//4:hidden_dim//2]
# ... and so on
 
# Code pattern
world_size = 4  # 4 GPUs
hidden_dim = 8192
local_hidden = hidden_dim // world_size  # 2048 per GPU
 
class TPLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        # Each GPU only stores out_features // world_size parameters
        self.weight = nn.Parameter(
            torch.randn(out_features // world_size, in_features)
        )
 
    def forward(self, x):
        # x is replicated across all GPUs
        out = F.linear(x, self.weight)
        # out has shape (batch, seq_len, out_features // world_size)
        return out

Each GPU independently computes its slice of Q, K, V. The attention computation happens locally on each GPU (each sees its shard of all three matrices), producing partial scores and outputs. Then comes the critical step: all-reduce synchronization.

The All-Reduce: Where Communication Dominates

After the attention layer, we need to gather partial results back into a full output tensor:

python

# After local attention, each GPU has partial output
local_output = attention(Q_local, K_local, V_local)
# shape: (batch, seq_len, hidden_dim // world_size)
 
# All-reduce: sum across all GPUs, scatter result back
# This is a collective operation
full_output = all_reduce_sum(local_output)
# Now every GPU has the complete (batch, seq_len, hidden_dim)

Why all-reduce? Attention has mixed-parallel structure: Q and K can be computed in parallel (column-parallel layer), but the output needs the full result. An all-reduce sums the partial outputs and broadcasts the total back - it's a ring operation on modern interconnects.

This all-reduce is the latency killer for tensor parallelism. On a single-node DGX with NVLink, an all-reduce of a 70B model's hidden dimension (8192) takes roughly 100-200 microseconds per layer. Multiply by 80 layers, and you're looking at 8-16 milliseconds of pure communication per forward pass. That's 25-50% of your inference latency on a small TP=4 cluster with NVLink.

Memory vs. Communication Tradeoff

Here's the hidden layer reasoning: tensor parallelism trades memory for communication.

With TP=1 (no parallelism):

One GPU holds all 140 GB of weights
One forward pass: no all-reduce overhead

With TP=8:

Each GPU holds 17.5 GB of weights (fits on a single A100 40GB)
One forward pass: 7 all-reduces (one per attention+FFN block, 80 blocks × 2 = ~160 all-reduces per forward pass for a 70B model)

Practical config for Llama-3-70B:

python

# TP=4: reasonable for 8 A100 GPUs (can hold ~35GB per device)
# Cost: 3 all-reduces per layer, minimal bubble
# NVLink bandwidth: 900 GB/s per GPU, so 8-16ms per pass is acceptable
 
# TP=8: pushes limits of PCIe 4.0 clusters
# 8 PCIe 4.0 GPUs in pairs on 2 sockets: effective bandwidth drops to 32 GB/s
# 7 all-reduces per layer: 400+ ms overhead. Don't do this.
 
config = {
    "model_size": "70B",
    "tp_degree": 4,
    "pp_degree": 1,  # We'll discuss this next
    "hardware": "8x A100 80GB on DGX, NVLink",
}

Pipeline Parallelism: Splitting Layers Across Devices

Pipeline parallelism (PP) takes a different approach: instead of splitting each layer, we assign consecutive layers to different devices.

GPU 0: Embedding + Layers 0-19 GPU 1: Layers 20-39 GPU 2: Layers 40-59 GPU 3: Layers 60-79 + Output

This is appealing in theory - no all-reduce overhead, just sequential data flow. But inference reveals a painful truth: pipeline parallelism is terrible for latency.

The Pipeline Bubble Problem

In training, we hide pipeline latency with micro-batching and gradient accumulation. In inference, we generate one token at a time (at least for the first token). Watch what happens:

python

# PP=4: Each GPU has 20 layers
# Token 1 enters GPU 0
 
timestamp  | GPU 0  | GPU 1  | GPU 2  | GPU 3
-----------|--------|--------|--------|--------
t=0        | busy   | idle   | idle   | idle
t=1        | done   | busy   | idle   | idle
t=2        | done   | done   | busy   | idle
t=3        | done   | done   | done   | busy
t=4        | done   | done   | done   | done
 
# Token 2 enters GPU 0
t=5        | busy   | idle   | idle   | idle
t=6        | done   | busy   | idle   | idle
...
 
# Total latency for 2 tokens: 8 timesteps
# GPU utilization: 25% (only GPU 0 computes during first pass)

This is the pipeline bubble: while GPU 0 is processing token 2, GPUs 1-3 are idle waiting for GPU 0 to finish its first 20 layers.

Why Pipeline Parallelism Underperforms for Inference

The root cause: inference generates one token at a time, and tokens are sequential dependencies. You cannot overlap generation of token N+1 until token N has reached the output layer. This is unlike training, where we can batch 32 different sequences independently.

Researchers have explored micro-batching and continuous batching to mitigate this, but it doesn't fix the fundamental constraint: first-token latency (the latency to generate token 1) is dominated by PP depth. For a 70B model split across 4 GPUs with PP, you're paying serial latency of 4 layer-groups, not parallel latency.

Comparison:

TP=4, PP=1: First-token latency = ~30ms (single pass, with all-reduce overhead)
TP=1, PP=4: First-token latency = ~120ms (4 sequential stage passes, minimal communication but severe pipeline bubble)
TP=4, PP=2 (hybrid): First-token latency = ~50-60ms (some bubble, but mitigated by smaller PP degree)

For inference, we almost never use pure pipeline parallelism. The hybrid approach (small TP, minimal PP) makes sense only for very large models (1T+ parameters) where TP alone requires unacceptable communication overhead.

Understanding communication patterns is critical for optimizing multi-GPU inference. In tensor parallelism, every forward pass requires all-reduce operations where each GPU must communicate with all others. For a 100-GPU cluster running tensor parallelism, the communication overhead grows quadratically. Modern GPU clusters use high-speed interconnects like NVIDIA's NVLink to minimize communication latency, but the overhead is still significant. This is why tensor parallelism works well for four or eight GPUs but becomes problematic for much larger clusters.

Pipeline parallelism addresses this by dividing the model vertically. Each GPU processes a subset of layers. When GPU one finishes its forward pass, GPU two starts processing the activations. This pipeline naturally overlaps computation and communication, reducing the total communication overhead. However, it introduces pipeline bubbles. When the first GPU finishes and has to wait for the last GPU to complete before starting the backward pass, those intermediate GPUs sit idle.

The engineering challenge is that you need to choose your parallelism strategy before deploying, because the strategy affects how you compile the model and configure the runtime. Changing from tensor parallelism to pipeline parallelism is not a simple configuration tweak; it requires different model transformations and different execution patterns. This is why understanding the tradeoffs is essential at design time.

Hardware considerations matter enormously. GPU-to-GPU connection speed varies dramatically. GPUs in the same node might have NVLink (400+ GB/s) while GPUs in different nodes go through network (10-100 GB/s). This difference means that tensor parallelism is feasible across GPUs in the same node but quickly becomes problematic across network boundaries. Pipeline parallelism, which has lower per-node communication requirements, works better across networks.

Practical considerations include resource utilization and operational complexity. A tensor parallelism setup with four GPUs per job is simple and straightforward. A pipeline parallelism setup with thirty-two GPUs across eight nodes requires careful orchestration and sophisticated monitoring. The latter scales better but is harder to operate. Teams need to match the complexity to their operational maturity and team size.

Sizing Tensor Parallelism for Your Cluster

Let's make this concrete with a sizing exercise. This is where theory meets practice, and where many teams make expensive mistakes. You can't just pick a TP degree because it sounds reasonable. You need to calculate it based on your specific hardware and requirements.

The calculation has multiple dimensions. You care about memory (does the model fit?), communication overhead (how much is all-reduce costing you?), and latency (what's the end-to-end time from input to output?). These are in tension. Higher TP degree reduces memory per GPU but increases communication overhead. You're walking a tightrope.

Real teams iterate on this. They pick a TP degree, deploy it, measure latency and memory utilization, and adjust. The first iteration is usually wrong. They might pick TP=2 thinking "more parallelism," deploy to production, see that communication overhead is eating their latency budget, and switch to TP=4. Or they pick TP=8 thinking "smaller per-GPU memory," realize the all-reduce overhead is killing them, and drop back to TP=4.

This is why we're giving you sizing tools. Calculate first, deploy second. Iterate based on actual measurements, not intuition.

Llama-3-70B Configurations

python

import math
 
model_params = 70e9  # 70B parameters
param_bytes = 2  # float16
hidden_dim = 8192
num_heads = 64
ffn_hidden = 28672  # ~3.7x hidden for Llama
 
kv_cache_bytes_per_token = (
    2 * num_heads * (hidden_dim // num_heads) * 2 * 1024
    # 2 for K and V, 2 for float16, 1024 tokens context
)
 
def sizing_calculator(tp_degree, batch_size, gpu_memory_gb=40):
    """Estimate memory and communication overhead."""
 
    weights_per_gpu = model_params * param_bytes / tp_degree / 1e9  # GB
    kv_cache_per_request = kv_cache_bytes_per_token / 1e9  # GB
    kv_total = kv_cache_per_request * batch_size
 
    # Each all-reduce touches hidden_dim bytes per layer
    # 80 layers, 2 all-reduces per layer (attn + ffn)
    all_reduce_bytes_per_pass = hidden_dim * 4 * 160  # 4-byte float32
 
    # Estimate latency (rough, NVLink)
    nvlink_bw = 900  # GB/s for 8-GPU NVLink ring
    allreduce_latency_ms = (all_reduce_bytes_per_pass / 1e9) / nvlink_bw * 1000
 
    return {
        "tp_degree": tp_degree,
        "weights_per_gpu_gb": weights_per_gpu,
        "kv_cache_per_batch_gb": kv_total,
        "total_memory_gb": weights_per_gpu + kv_total,
        "allreduce_per_pass_ms": allreduce_latency_ms,
        "fits_on_40gb": (weights_per_gpu + kv_total) < gpu_memory_gb,
    }
 
# Run sizing
for tp in [1, 2, 4, 8]:
    result = sizing_calculator(tp, batch_size=4)
    print(f"TP={tp}: weights={result['weights_per_gpu_gb']:.1f}GB, "
          f"fits={result['fits_on_40gb']}, "
          f"comm_latency={result['allreduce_per_pass_ms']:.1f}ms")
 
# Output:
# TP=1: weights=140.0GB, fits=False, comm_latency=0.0ms
# TP=2: weights=70.0GB, fits=False, comm_latency=1.2ms
# TP=4: weights=35.0GB, fits=True, comm_latency=2.4ms
# TP=8: weights=17.5GB, fits=True, comm_latency=4.8ms

Decision logic:

TP=1 is impossible (weights exceed single GPU memory)
TP=2 requires 2 GPUs per copy; communication adds ~1-2ms per token
TP=4 is the sweet spot for 70B: fits on A100 40GB, communication overhead ~2-3ms, acceptable
TP=8 works for A100 80GB, but communication balloons; only use if you have high-bandwidth interconnect (NVLink)

Real-World Latency Numbers

Here's what you'll actually see on production hardware:

Configuration          | P50 Latency | P99 Latency | Notes
-----------------------|-------------|-------------|------
1xA100 40GB (OOM)      | N/A         | N/A         | Can't fit
2xA100 40GB TP=2       | 45ms        | 65ms        | PCIe bottleneck
4xA100 40GB TP=4       | 28ms        | 38ms        | Balanced
8xA100 40GB TP=8       | 35ms        | 52ms        | Comm overhead
8xA100 80GB TP=4*2     | 26ms        | 35ms        | Best (hybrid)
DGX8 NVLink TP=4       | 22ms        | 28ms        | Full-speed NVLink

The key observation: as you increase TP degree beyond 4, communication overhead grows faster than memory savings. P99 latency starts climbing because all-reduce completion time is non-deterministic on shared networks.

Understanding Communication Patterns in Tensor Parallelism

Before we move to hardware specifics, let's visualize how tensor parallelism actually orchestrates communication. The all-reduce pattern is critical to understand because it's where time disappears.

graph TD
    A["Input tokens<br/>(batch, seq_len, hidden_dim)"] -->|"Replicate across GPUs"| B["GPU 0-3: Q/K/V compute<br/>Column-parallel layers"]
    B --> C["GPU 0: Q_0, K_0, V_0<br/>GPU 1: Q_1, K_1, V_1<br/>GPU 2: Q_2, K_2, V_2<br/>GPU 3: Q_3, K_3, V_3"]
    C --> D["Local attention:<br/>scores_i = Q_i @ K_i.T<br/>out_i = softmax @ V_i"]
    D --> E["Partial outputs<br/>(batch, seq_len, hidden_dim/4)"]
    E -->|"ALL-REDUCE SUM"| F["Full output<br/>All GPUs: (batch, seq_len, hidden_dim)"]
    F --> G["Feed-forward + residual<br/>Next attention layer"]
 
    style B fill:#ffcccc
    style E fill:#ffffcc
    style F fill:#ccffcc

The all-reduce (highlighted above) is the synchronization barrier. Every GPU must wait for every other GPU to finish its partial computation, then the sum is broadcast back. On a 4-way TP setup with hidden_dim=8192, this happens 160 times (once per attention+FFN block in an 80-layer model, twice per block if we count separate attention and FFN all-reduces).

Hardware Topology: NVLink vs. PCIe

This is where many teams stumble. The choice between NVLink and PCIe is not optional - it determines your maximum viable TP degree. And unfortunately, this choice is often made for you by budget or existing infrastructure. You inherit a cluster without NVLink, you have to make tensor parallelism work on PCIe. That's a constraint you live with.

NVLink is a specialized interconnect built by NVIDIA. It's a full-duplex connection between GPUs with massive bandwidth (600 GB/s per GPU in the best case). When you have NVLink, you have a non-blocking network between GPUs. Communication is fast. Your TP degree can be larger.

PCIe is a commodity bus. It's slower than NVLink - typically 32 GB/s per GPU on PCIe 4.0. And there's worse: in a typical 2-socket server, GPUs on the same socket share a PCIe root complex, while GPUs on different sockets communicate through a slower inter-socket link (QPI/UPI). This asymmetry matters. All-reduce on a 2-socket machine with PCIe will take longer than on a single-socket machine with NVLink because some communication crosses the slow inter-socket link.

This is why infrastructure architecture matters for ML. The teams with NVLink get to use larger TP degrees and achieve lower latency. The teams with PCIe-only clusters have to accept smaller TP degrees and higher latency. It's not a software problem. It's a hardware problem.

NVLink: The Ideal Case

A DGX A100 with 8 GPUs connected via NVSwitch provides a full all-to-all non-blocking network.

On a DGX A100 with NVLink, the all-reduce is a ring collective where data passes through the switch in a carefully orchestrated sequence. Even with TP=8, the latency is minimal.

Real numbers for NVLink:

All-reduce for 70B (8KB hidden × 160 layers = 1.3 MB)
NVLink bandwidth: 600 GB/s per GPU, aggregate 4.8 TB/s in ideal case
Ring latency: ~1.3 MB / 900 GB/s effective ≈ 1.4 microseconds
Per forward pass overhead: <5ms even with TP=8

With NVLink, TP=8 is safe because the all-reduce latency per forward pass is <5ms - acceptable overhead. You can run Llama-3-70B inference at latencies under 25ms P50, under 35ms P99.

PCIe: The Reality Check

In a typical 2-socket server with PCIe 4.0:

Architecture:

    Socket 0               Socket 1
    ┌─────────────┐      ┌─────────────┐
    │ GPU0 GPU1   │      │ GPU4 GPU5   │
    │ GPU2 GPU3   │      │ GPU6 GPU7   │
    └──┬──────┬──┘      └──┬──────┬──┘
       │  PCIe  │          │  PCIe  │
       └────┬───┘──────────┘────┬──┘
            │      QPI/UPI      │
           CPU0               CPU1

Bandwidth:
- Within-socket (GPU0↔GPU1): 32 GB/s (PCIe 4.0 ×16)
- Cross-socket (GPU0↔GPU4): 8 GB/s (QPI/UPI bottleneck)

An all-reduce with TP=4 on a 2-socket machine means some GPUs must send data cross-socket (8 GB/s), while others send within-socket (32 GB/s). The all-reduce takes 4-5x longer than NVLink, potentially adding 15-20ms per token.

Topology-aware TP config:

python

def optimal_tp_for_hardware(num_gpus, interconnect):
    """Recommend TP degree based on topology."""
 
    if interconnect == "NVLink":
        # Full bisection bandwidth
        return min(8, num_gpus)  # TP=8 is fine
 
    elif interconnect == "PCIe_same_socket":
        # Limited to socket bandwidth
        if num_gpus <= 4:
            return num_gpus  # TP=4 on single socket is safe
        else:
            return 4  # Larger systems need split across sockets
 
    elif interconnect == "PCIe_cross_socket":
        # Cross-socket penalty is severe
        return 2  # Keep TP small to minimize cross-socket traffic
 
# Usage
print(optimal_tp_for_hardware(8, "NVLink"))           # → 8
print(optimal_tp_for_hardware(8, "PCIe_same_socket")) # → 4
print(optimal_tp_for_hardware(8, "PCIe_cross_socket"))# → 2

Expert Parallelism for Mixture of Experts Models

Mixture of Experts (MoE) models like Mixtral-8x7B introduce a new dimension: expert parallelism.

Unlike dense models where all parameters are used for every token, MoE models route each token to a subset of experts. This creates opportunities - and challenges - for parallelism.

MoE Architecture Recap

python

# Simplified MoE layer
class MoELayer(nn.Module):
    def __init__(self, num_experts=8, expert_dim=7168):
        super().__init__()
        self.experts = nn.ModuleList([
            Expert(expert_dim) for _ in range(num_experts)
        ])
        self.router = nn.Linear(expert_dim, num_experts)
 
    def forward(self, x):
        # x shape: (batch, seq_len, hidden_dim)
        logits = self.router(x)  # (batch, seq_len, num_experts)
        weights = softmax(logits, dim=-1)
 
        # Route token to top-2 experts (sparse)
        top_k_weights, top_k_indices = topk(weights, k=2)
 
        # Compute expert outputs (only for selected experts)
        output = torch.zeros_like(x)
        for expert_idx in range(num_experts):
            mask = (top_k_indices == expert_idx).any(dim=-1)
            if mask.any():
                output[mask] += self.experts[expert_idx](x[mask]) * top_k_weights[mask]
 
        return output

The challenge: tokens need to reach their assigned experts, which may be on different GPUs. This requires an all-to-all exchange.

Expert Parallelism with All-to-All

python

def expert_parallel_forward(x, num_experts, tp_degree):
    """
    Token-to-expert routing with all-to-all communication.
 
    Assumption: experts are partitioned across GPUs.
    E.g., with num_experts=8 and tp_degree=4:
      GPU0: experts 0,4
      GPU1: experts 1,5
      GPU2: experts 2,6
      GPU3: experts 3,7
    """
 
    batch_size, seq_len, hidden_dim = x.shape
 
    # Step 1: Route tokens to experts (local)
    logits = router(x)  # (batch, seq_len, 8)
    top_k_indices = topk(logits, k=2).indices  # (batch, seq_len, 2)
 
    # Step 2: Determine which GPU owns each expert
    expert_gpu = {
        0: 0, 1: 1, 2: 2, 3: 3,
        4: 0, 5: 1, 6: 2, 7: 3,
    }
    target_gpus = torch.tensor([
        [expert_gpu[e.item()] for e in row]
        for row in top_k_indices.reshape(-1, 2)
    ])
 
    # Step 3: All-to-all communication
    # Each token must move to its assigned expert GPU
    x_rearranged = all_to_all_rearrange(x, target_gpus)
    # x_rearranged: tokens grouped by their destination GPU
 
    # Step 4: Compute experts locally (each GPU runs its expert subset)
    output_local = compute_local_experts(x_rearranged)
 
    # Step 5: All-to-all back
    output = all_to_all_rearrange(output_local, reverse=True)
 
    return output

Communication cost for MoE:

python

# Mixtral-8x7B: 8 experts, each 7B params
# TP=4, so 2 experts per GPU
 
# All-to-all exchanges:
# - Forward: token redistribution (batch * seq_len * hidden_dim * 2 / tp_degree)
# - Backward (if training): gradient redistribution
 
# Estimation for batch=1, seq_len=1024, hidden=4096, TP=4:
# All-to-all size: 1 * 1024 * 4096 * 2 / 4 = ~2 MB per all-to-all
# Latency on NVLink: 2 MB / 900 GB/s ≈ 2.2 microseconds
 
# Per forward pass: 2 all-to-alls (fwd + aggregate) ≈ 5 microseconds
# For 32 MoE layers: 160 microseconds per forward pass
# Not negligible but manageable compared to dense attention.

Load balancing challenge:

If your router sends all tokens to experts 0 and 1, experts 2-7 are idle. This is a skew problem that degrades throughput.

python

# Load balancing auxiliary loss (used during training)
def auxiliary_loss_load_balance(router_logits, top_k_indices):
    """
    Minimize routing load imbalance.
    Encourages: P(expert=i) ≈ 1/num_experts for all i
    """
    num_experts = router_logits.shape[-1]
 
    # Probability of selecting each expert
    expert_probs = softmax(router_logits, dim=-1).mean(dim=(0, 1))
 
    # Target: uniform distribution
    target = 1.0 / num_experts
 
    # Penalize deviation from target
    balance_loss = torch.sum((expert_probs - target) ** 2)
 
    return balance_loss
 
# In practice, Mixtral uses learned load-balancing gates
# that are updated during inference to route tokens evenly.

The practical reality of deploying tensor parallelism is that you're making a tradeoff between model size and throughput. With aggressive tensor parallelism, you can fit larger models, but you reduce throughput because of communication overhead. With minimal parallelism, you have higher throughput but can only fit smaller models. The sweet spot depends on your hardware, your cluster size, and your performance requirements.

Batching in tensor parallel systems requires careful coordination. When you batch multiple requests together, all requests must follow the same execution path through the parallelized layers. If one request finishes while others are still processing, you're wasting computation. Some systems implement variable-length batching where you can drop finished requests and add new ones mid-batch, but this is complex. Most systems use fixed-size batches, which simplifies the implementation but might leave GPUs idle waiting for the slowest request in the batch.

Memory efficiency is another critical concern. Each GPU in a tensor-parallel setup holds a copy of the full model weights. This is redundant but necessary for the communication pattern. The alternative, keeping weights in a shared memory space, would make computation impossibly slow. You're trading computation efficiency for memory overhead. The practical impact is that you need enough GPU memory to hold the model partition plus activation memory, and that activation memory grows with batch size.

Debugging multi-GPU inference is significantly harder than debugging single-GPU inference. When something goes wrong, you need to understand whether the issue is in the computation, the communication, or the coordination. Tools like NVIDIA's Nsys can profile GPU execution, showing where time is spent and where there are communication bottlenecks. This level of visibility is essential for optimization.

Model compilation for tensor parallelism requires understanding the parallelization pattern. Some frameworks like vLLM-production-deployment-guide) handle much of this automatically, detecting the optimal parallelism strategy for your hardware and model. Other frameworks like DeepSpeed-zero-memory-efficient-training)-comparison) give you more control but require you to specify the parallelism strategy explicitly. The level of automation you want depends on how much control you need and how much engineering effort you want to spend.

Testing tensor parallel configurations requires simulating realistic load. You can't just run one request through the system; you need to test with multiple concurrent requests to see how batching, scheduling, and resource allocation work. You need to test with varying request sizes to see how the system handles heterogeneous workloads. You need to test failure scenarios to see how the system recovers when a GPU fails.

Monitoring production tensor parallel systems requires awareness of both individual GPU metrics and cluster-wide metrics. Individual GPU metrics like utilization and memory usage tell you how well resources are being used. Cluster-wide metrics like throughput and latency tell you if the system is meeting its performance goals. Correlation between these metrics helps identify optimization opportunities.

Cost optimization in tensor parallel inference centers on utilization. If GPUs are running at low utilization, you're wasting money. You might consolidate workloads so fewer GPUs are fully utilized rather than many GPUs at low utilization. Or you might increase batch size to improve utilization. Or you might use a less aggressive parallelism strategy and accept that you can serve smaller models. The right choice depends on your constraints and goals.

Putting It All Together: A Production Config

Let's design a configuration for serving Llama-3-70B at 100 requests/second with <50ms P99 latency.

python

# Production Inference Setup for Llama-3-70B
 
config = {
    # Model & parallelism
    "model": "Llama-3-70B",
    "tp_degree": 4,  # Balanced: 35GB per GPU, manageable comm
    "pp_degree": 1,  # No pipeline for inference (latency killer)
    "expert_parallel": False,  # Not an MoE model
 
    # Hardware
    "hardware": "8x A100 80GB on DGX A100",
    "interconnect": "NVLink",  # Full bisection bandwidth
 
    # Inference strategy
    "batch_strategy": "continuous_batching",  # Mixture of prefill & decode
    "max_batch_size_prefill": 32,  # Larger batches for prefill
    "max_batch_size_decode": 256,  # Streaming decode is compute-light
 
    # Expected performance
    "first_token_latency_ms": 25,  # P50 (prefill-bound)
    "decode_latency_per_token_ms": 8,  # P50 (memory-bound)
    "throughput_tokens_per_sec": 2000,  # At batch_size=256 decode
    "p99_latency_ms": 35,  # Includes queueing
}
 
# Actual serving implementation sketch
import torch
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Llama-3-70b",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.85,
    enforce_eager=False,  # Use flash attention
)
 
# Continuous batching: mix of prefill (token=1) and decode (tokens=2+)
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)
 
# Requests queue managed by vLLM
requests = [
    # ... 100 requests/sec ...
]
 
outputs = llm.generate(requests, sampling_params)

Key Takeaways

Tensor Parallelism trades memory for communication. TP=4 is often optimal for 70B models on modern hardware - larger TP degrees hit communication ceiling.
All-reduce latency dominates inference end-to-end latency. Even on NVLink, it adds 5-10ms per forward pass. On PCIe, it's 20-50ms. Size your TP accordingly.
Pipeline Parallelism is almost never used for inference. The pipeline bubble makes it terrible for latency. Use only in extreme cases (1T+ models) with careful micro-batching.
Hardware topology is non-negotiable. NVLink enables TP=8, PCIe limits you to TP=2-4. If your infrastructure lacks NVLink, reduce TP and accept higher latency.
Expert Parallelism in MoE adds all-to-all exchanges. These are cheaper than dense attention all-reduces but introduce load-balancing complexity. Mixtral-8x7B becomes attractive on smaller clusters precisely because expert routing is cheaper than dense tensor parallelism.
Continuous batching hides latency through multiplexing. Prefill (first token) and decode (subsequent tokens) have different compute/memory ratios. Interleaving them keeps GPUs saturated.

The engineers who dominate LLM inference use tensor parallelism wisely - sizing TP to their interconnect, not their GPU count. They understand that communication is the real bottleneck and design their systems (and their parallelism strategy) around it.

Real-World Deployment Checklist

Before you commit to a TP configuration, validate these operational concerns:

Memory stability:

python

# Will KV cache fit as batch grows?
def check_kv_headroom(model_size, tp_degree, gpu_memory_gb):
    weights = model_size * 2 / tp_degree / 1e9
 
    # Worst case: batch=64, seq=4096
    kv_cache = 2 * 64 * 4096 * 8192 * 2 / 1e9
 
    activations = weights * 0.1  # rough estimate
    total = weights + kv_cache + activations
 
    headroom = gpu_memory_gb - total
    return headroom > 5  # Keep 5GB headroom for safety
 
# For Llama-3-70B on A100 40GB
is_safe = check_kv_headroom(70e9, tp_degree=4, gpu_memory_gb=40)
print(f"TP=4 is safe: {is_safe}")  # True, with room to spare

Network topology validation:

bash

# On a DGX, test actual bandwidth
nvidia-smi -i 0 -q -d ECC | grep NVLINK  # Verify NVLink active
nccl-tests/build/all_reduce_perf -b1M -e256M  # Benchmark all-reduce
# You should see ~100GB/s+ on NVLink
# If <20GB/s, you have cross-socket or PCIe bottleneck

Latency profiling:

python

# Instrument a single forward pass
import time
import torch
 
with torch.cuda.device(0):
    torch.cuda.synchronize()
    start = time.perf_counter()
 
    output = model.forward(tokens)
 
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
 
    print(f"E2E latency: {elapsed*1000:.1f}ms")
    # Breakdown: prefill 15ms, attn layers 8ms, ffn layers 5ms, etc.

These checks prevent surprises in production: mysterious OOM errors, unexplained latency spikes, and load-skew induced slowdowns.

** The practical reality of deploying large models is that you need to be pragmatic about your constraints. Maybe you don't have enough GPUs to run your ideal configuration. Maybe you don't have network bandwidth for the communication your chosen strategy requires. Maybe you don't have engineers with the expertise to manage a sophisticated setup. Given your constraints, you find the best solution within those boundaries. This might mean running a smaller model, accepting lower throughput, or using a simpler parallelism strategy.

The migration path from single-GPU to multi-GPU inference is important. You don't want to redesign your entire system when you outgrow a single GPU. You want a path where you can grow incrementally. This suggests starting with a simple architecture-guide), instrumenting it well, and understanding where the bottlenecks are before you add complexity. Sometimes the constraint isn't the model size or throughput, but the cost. In that case, you might optimize for cost-efficiency first, then worry about scaling to more GPUs later.

The relationship between batch size and parallelism is important. Larger batches provide better GPU utilization but longer latency. If your users can't tolerate high latency, you need smaller batches. Smaller batches require higher throughput, which might require better parallelism. These constraints interact in complex ways. Understanding them helps you design systems that meet your performance requirements.

Upgrading existing production systems to add parallelism is complex. You can't just add new GPUs and expect things to work. The system needs to be designed for parallelism from the start, or you need a careful migration plan. Some teams run two parallel systems during the transition, gradually migrating traffic from the old system to the new one. This adds operational complexity but enables safe migration.

The future of inference parallelism likely involves more dynamic and adaptive strategies. Instead of fixing a parallelism strategy at deployment time, the system might adapt dynamically based on load and constraints. It might use coarse-grained parallelism when load is low and fine-grained parallelism when load is high. It might rebalance GPUs dynamically based on which requests are waiting. These dynamic systems are more complex but more adaptable.

The human factor in operating multi-GPU systems should not be underestimated. Engineers need to understand how tensor parallelism works, how to debug issues, and how to interpret performance metrics. Without this understanding, they'll make wrong decisions that hurt performance. Training and documentation are essential.

Finally, remember that parallelism is a tool to solve specific problems. If you don't have those problems, parallelism adds complexity without benefit. Start simple, measure, and add complexity only when measurement shows you need it. This principled approach leads to systems that are simple, understandable, and well-optimized for your actual constraints.

The performance characteristics of different parallelism strategies vary significantly with model architecture. For transformer-based models, tensor parallelism works well because matrix operations naturally decompose across GPUs. For models with different architectures like recurrent networks or mixture-of-experts, different parallelism strategies might be better. Understanding your specific model architecture helps guide the choice of parallelism strategy.

The training-versus-inference distinction is important because inference has different constraints. During training, you care about throughput. During inference, you care about both throughput and latency. This means the parallelism strategy that works best for training might not work for inference. Inference often requires smaller batches, which means you need fine-grained parallelism. Training can use larger batches, which means coarser-grained parallelism works.

The iterative nature of optimization is worth emphasizing. You don't get the parallelism strategy right on the first try. You implement something, measure it, identify bottlenecks, and optimize. Maybe you discover that communication is the bottleneck, so you use coarser-grained parallelism. Maybe you discover that computation is the bottleneck, so you add more parallelism. This iterative process leads to systems that are well-optimized for your specific constraints.

The documentation of your parallelism strategy is important for team continuity. What parallelism strategy did you choose? Why? What are the constraints and assumptions? How would you change it if those assumptions changed? When new engineers join, they need to understand these decisions. Well-documented decisions make the system more maintainable.

The relationship between parallelism and fault tolerance is worth considering. If a GPU fails in the middle of a multi-GPU inference, what happens? Do you restart the inference? Do you gracefully degrade? Do you have checkpoints to resume from? Some systems implement fault-tolerant inference where GPU failures are detected and recovered from. This is valuable for reliability but adds complexity.

The interaction between model serving and parallelism is important. Model serving-inference-server-multi-model-serving) systems like Triton or vLLM handle a lot of the parallelism complexity for you. They can automatically choose parallelism strategies based on your hardware and model. Using these tools can save you months of engineering effort compared to building parallelism from scratch. The tradeoff is less control over optimization.

The emerging techniques in inference parallelism are worth following. Speculative decoding uses one model to generate candidate tokens, then verifies them with another model in parallel. Continuous batching allows new requests to start processing before previous requests finish. Disaggregated prefill and decode separate these stages and run them on separate GPUs. These techniques provide order-of-magnitude improvements in throughput.

The relationship between data parallelism and tensor parallelism is worth understanding. Data parallelism is easier to implement but might not be sufficient if the model doesn't fit in a single GPU. Tensor parallelism fits larger models but introduces communication overhead. Sometimes you use both: data parallelism for handling multiple requests and tensor parallelism for handling large models. The combination requires careful orchestration.

The cloud infrastructure implications of multi-GPU inference are worth considering. Running inference on expensive GPU resources requires careful cost management. You might want to pause GPUs when not in use. You might want to share GPUs across models. You might want to burst to larger clusters during peak times. Cloud-native inference requires thinking about elasticity and cost.

The final principle is that parallelism is not the goal; performance and efficiency are the goal. Sometimes the simplest solution is the best. If you can serve your model fast enough on a single GPU, do that. Add parallelism only when it's necessary. This pragmatic approach leads to systems that are simple, understandable, and well-optimized.

Multi-GPU Inference with Tensor and Pipeline Parallelism

The Core Problem: Why Single-GPU Inference Breaks at Scale

Tensor Parallelism: Splitting the Weights

How It Works: Q/K/V Splitting

The All-Reduce: Where Communication Dominates

Memory vs. Communication Tradeoff

Pipeline Parallelism: Splitting Layers Across Devices

The Pipeline Bubble Problem

Why Pipeline Parallelism Underperforms for Inference

Sizing Tensor Parallelism for Your Cluster

Llama-3-70B Configurations

Real-World Latency Numbers

Understanding Communication Patterns in Tensor Parallelism

Hardware Topology: NVLink vs. PCIe

NVLink: The Ideal Case

PCIe: The Reality Check

Expert Parallelism for Mixture of Experts Models

MoE Architecture Recap

Expert Parallelism with All-to-All

Putting It All Together: A Production Config

Key Takeaways

Real-World Deployment Checklist

Need help implementing this?