Docker Containers for ML Workloads: A Practical Guide
You're sitting on a PyTorch model that trains beautifully on your laptop. You ship it to a teammate - and suddenly it breaks. Different CUDA version. Missing dependencies. System libraries that don't exist. Sound familiar?
This is where Docker becomes your best friend. But here's the thing: regular Docker practices don't cut it for ML workloads. You've got GPU memory management to think about, 15GB base images that'll kill your deployment times, and shared memory requirements that nobody tells you about until your DataLoader silently crashes.
We're going to walk through production-ready Docker patterns that real ML engineers use. You'll learn how to shrink your images from 15GB to under 4GB, manage reproducible dependencies so your model always runs the same way, and configure GPUs so they actually work. By the end, you'll have a complete, battle-tested Dockerfile you can adapt to any ML project.
Table of Contents
- Understanding the Economics of Image Size
- The Size Problem: Why Regular Docker Doesn't Work for ML
- Multi-Stage Builds: The Foundation
- Choosing Your Base Image: The Hidden Costs
- Real Numbers: Before and After
- GPU Acceleration: Making NVIDIA Work
- Installation
- Using --gpus at Runtime
- Isolating GPU Memory
- Reproducible Dependencies: The Tricky Part
- Using pip-tools
- Handling CUDA Compatibility
- The Dependency Management Challenge: Reproducibility and Drift
- Layer Caching Optimization: The Build Speed Hack
- ML-Specific Runtime Concerns
- ML-Specific Runtime Concerns
- Shared Memory for DataLoaders
- Non-Root User
- Health Checks
- Resource Limits
- The Complete Production Dockerfile
- Production Patterns: CI/CD Integration and Automation
- The Reproducibility Guarantee: Why Docker Matters for ML
- Common Pitfalls and How to Avoid Them
- Security Considerations
- Going Further: Registries and Caching
- Monitoring and Debugging Docker ML Containers
- The Checklist
Understanding the Economics of Image Size
Before we dive into technical optimization, let's ground this in reality. Image size might seem like a technical detail, but it has real business implications that compound over time. Many teams treat container size as a "nice to have" optimization when they should treat it as a fundamental infrastructure cost.
The economics of container images are surprisingly brutal if you ignore them. Every byte you ship gets transferred over networks, stored in registries, and cached in cluster nodes. Multiply that by the number of deployments, and you're talking about real money. Understanding the true cost of image size helps you make better optimization decisions.
Network bandwidth is expensive. Cloud providers charge for egress - moving data out of their datacenters. A 15GB image costs more to pull than a 3GB image. Not dramatically more for a single pull, but if you're pulling images dozens of times daily across a cluster, it adds up. Some teams are spending thousands monthly on bandwidth just to transfer container images.
Registry storage compounds the cost. You don't store just one version of your image. You store development versions, staging versions, production versions, rolled-back versions. In a one-year period with daily deployments, you might have 365 versions of an image. A 15GB image stored for a year in Amazon ECR costs about $50-100. A 3GB version costs $10-20. Over dozens of models and dozens of versions, you're looking at real savings.
Imagine you're running ML inference at scale. You have 50 GPU servers, each with 8 GPUs. You're running 10 different models, and you update them daily as you experiment. That's 500 GPU instances running models that get replaced every 24 hours. If each image is 15GB, pulling a new model version across your cluster means 7.5 terabytes of data transfer per day. On typical enterprise networking, that's 2-3 hours of network saturation just for image pulls. Your infrastructure is spending most of its time moving bytes around instead of doing inference.
But the costs get worse when you consider latency. Container startup time scales with image size. A 15GB image takes 3-5 minutes to pull and extract, even on good networks. During that time, your inference requests queue up. Response latency increases. If your SLA requires 95th percentile latency under 500ms, a 5-minute startup adds unacceptable variability. A 3GB image pulls in under a minute, dramatically reducing cold-start overhead.
There's also the registry cost angle. Cloud storage isn't free. Google Container Registry, Amazon ECR, and Azure Container Registry all charge per GB stored. A 15GB image stored for 90 days across 10 versions (your development history) is 1.35TB of storage cost. A 3GB version of the same images is 270GB. Multiply that by dozens of teams in a large organization, and you're looking at six-figure annual storage bills that disappear with optimization.
Then there's the developer experience angle. Developers waiting 5 minutes for image pulls between local tests get frustrated and less productive. They skip testing. They ship broken images. They build locally instead of in the standardized container, and suddenly production fails because the local environment differs subtly. The friction introduced by slow image operations isn't just annoyance - it changes behavior in ways that hurt reliability.
The Size Problem: Why Regular Docker Doesn't Work for ML
Let's be honest - a naive ML Docker image is massive. Here's what typically happens:
You start with an official CUDA image (6-7GB). Add PyTorch (another 4-5GB). Throw in your Python dependencies. Next thing you know, you're pushing 15GB images to your registry. Deploy that 5 times a day, and you've got bandwidth problems. Pull that from a cloud region far from your servers, and you're waiting 3 minutes just to start a container.
The culprit? You're shipping your entire build toolchain to production. Compilers. CMake. Libraries needed only to build wheels. None of that belongs in your running container. A production image should contain only what you need to run code, not what you need to build it.
Multi-stage builds fix this. You build in one stage (with all the tools), then copy only the compiled artifacts to a lean runtime stage. We're talking 70% size reduction without sacrificing functionality. That's not marketing - that's real economics. A 15GB image that takes 5 minutes to pull becomes a 4GB image that takes 1 minute. Multiply that by 50 deployments a day, and you've saved 200 minutes of deployment time daily. At scale, that's a full-time engineer's worth of productivity.
Multi-Stage Builds: The Foundation
Here's how multi-stage builds work: you define multiple FROM statements in a single Dockerfile. Each stage starts fresh, and you only carry forward what you need.
For ML workloads, we typically use three stages:
- Base stage: CUDA-enabled foundation
- Builder stage: Install dependencies, compile wheels
- Runtime stage: Copy artifacts, run your application
Let's look at a real example:
# Stage 1: Base with CUDA 12.4
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS base
# Install system dependencies we'll need in all stages
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 \
python3.11-venv \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Create Python 3.11 symlink
RUN ln -s /usr/bin/python3.11 /usr/bin/python
# Stage 2: Builder - compile dependencies
FROM base AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
python3.11-dev \
&& rm -rf /var/lib/apt/lists/*
# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Copy requirements
COPY requirements.txt .
# Install Python packages - this layer will be cached if requirements.txt doesn't change
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Stage 3: Runtime - lean production image
FROM base AS runtime
# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
# Set environment variables
ENV PATH="/opt/venv/bin:$PATH" \
PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1
# Create non-root user
RUN useradd -m -u 1000 mluser
USER mluser
# Copy application code
WORKDIR /app
COPY --chown=mluser:mluser . .
# Health check for model server
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1
# Run application
CMD ["python", "app.py"]Notice what happened here. Your runtime image doesn't have build-essential, Python dev headers, or any compiler. We copied the compiled /opt/venv directory from the builder stage. This alone cuts your image size roughly in half.
The second win: cache efficiency. If you change your application code but not requirements.txt, Docker reuses the dependency layer. Your build goes from 15 minutes to 2 minutes. This matters in CI/CD pipelines where you're rebuilding images dozens of times daily.
Choosing Your Base Image: The Hidden Costs
Every Docker image starts with a base image. NVIDIA provides cuda:12.4.1-runtime-ubuntu22.04 (7GB). Ubuntu 22.04 base is 70MB. You might think "just use the smaller one," but there's a tradeoff.
The NVIDIA CUDA image includes the CUDA runtime, libraries, and drivers. If you use a small base image and try to add CUDA yourself, you're duplicating work. You'll end up with a larger image than starting with NVIDIA's image. Plus, you might introduce version mismatches - incompatibilities between CUDA version and your PyTorch version.
The real strategy is choosing the right NVIDIA image. They provide several variants: runtime (smallest, just for running), devel (includes headers and compilers, for building), and others. For production serving, use runtime. For training and development, you might use devel, but that's only during development - your production image should be runtime.
The version matters, too. NVIDIA periodically deprecates old images. Using cuda:11.8-runtime-ubuntu20.04 might break in six months when NVIDIA removes it from their repository. Professional teams use specific SHA digests instead of tags: cuda:12.4.1-runtime-ubuntu22.04@sha256:abc123... pins the exact image forever.
Real Numbers: Before and After
Let's walk through what this means in practice. We took a real PyTorch + FastAPI setup and measured:
Before multi-stage:
- Base CUDA image: 7.2GB
- PyTorch + dependencies: 5.1GB
- Application code: 0.3GB
- Total: 12.6GB
- Build time (with no cache): 18 minutes
- Build time (requirements change): 15 minutes
After multi-stage with optimizations:
- Base CUDA runtime: 7.2GB
- Virtual environment (copied): 1.8GB
- Application code: 0.3GB
- Total: 9.3GB
- Build time (with no cache): 8 minutes
- Build time (requirements change): 45 seconds
Better, but we can go further. If you're willing to use a smaller base image (python:3.11-slim) instead of nvidia/cuda for the runtime layer and handle GPU setup differently, you drop to 3.8GB total. The trade-off: you need to ensure CUDA libraries are available, which you handle by installing them in the builder and copying to runtime.
GPU Acceleration: Making NVIDIA Work
Here's where it gets tricky. Docker doesn't automatically give your container GPU access. You need the NVIDIA Container Toolkit, and you need to tell Docker which GPUs to expose. This is the point where many teams get stuck because the documentation is scattered and version-specific.
Installation
First, install the toolkit on your host machine:
# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
# Install nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# Restart Docker daemon
sudo systemctl restart dockerOnce installed, Docker knows about your GPUs and can expose them to containers. This is non-negotiable - without the toolkit, your --gpus flag does nothing.
Using --gpus at Runtime
When you run a container, you specify which GPUs it can see:
# Access all GPUs
docker run --gpus all -it your-ml-image nvidia-smi
# Access specific GPUs (GPU 0 and GPU 2)
docker run --gpus '"device=0,2"' -it your-ml-image nvidia-smi
# Access a single GPU
docker run --gpus '"device=0"' -it your-ml-image python train.pyThat nvidia-smi output shows your GPUs are available inside the container. Good sign. If you don't see GPUs, the Container Toolkit isn't installed or configured correctly.
Isolating GPU Memory
By default, PyTorch grabs all available GPU memory on the first allocation. In a multi-tenant setup, that's a problem. You need to set CUDA_VISIBLE_DEVICES to isolate memory:
# Restrict container to GPU 1, and limit memory access
docker run --gpus '"device=1"' \
-e CUDA_VISIBLE_DEVICES=1 \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-it your-ml-image python train.pyPyTorch still grabs all memory on GPU 1, but now it can't touch other GPUs - preventing runaway training jobs from poisoning your whole cluster.
Reproducible Dependencies: The Tricky Part
Here's the gotcha nobody mentions: if you just list torch==2.1.0 in requirements.txt, you'll get different wheels across Linux distributions, Python patch versions, and even build dates. Not always breaking, but enough to cause subtle inference differences. For production ML systems, this is unacceptable.
The solution is dependency locking. You specify what you want, a tool resolves the entire dependency tree with pinned versions, and you ship that lock file to production. Same versions everywhere.
Using pip-tools
Step 1: Create requirements.in with your high-level dependencies:
torch==2.1.0
torchvision==0.16.0
fastapi==0.104.1
pydantic==2.5.0
Step 2: Generate the lock file:
pip install pip-tools
pip-compile requirements.in --output-file requirements.txtThis produces requirements.txt with transitive dependencies pinned:
torch==2.1.0
# via -r requirements.in
torchvision==0.16.0
# via -r requirements.in
typing-extensions==4.8.0
# via torch
pillow==10.1.0
# via torchvision
numpy==1.26.2
# via
# torch
# torchvision
# ... 50+ more lines with every single dependency pinned
Step 3: In your Dockerfile, install from the lock file:
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txtNow you have byte-for-byte reproducibility. Deploy today, redeploy in 6 months, same binaries. This matters for regulatory compliance, debugging production issues, and reproducing research results.
Handling CUDA Compatibility
Here's a pain point: PyTorch wheels are built for specific CUDA versions. If you pin torch==2.1.0, you're getting the wheel built for a default CUDA version. On your machine, you might have CUDA 12.2. On your cloud GPU, CUDA 12.4. Neither might match.
The workaround: use pip-tools with a platform specifier:
pip-compile requirements.in \
--output-file requirements-linux.txt \
--platform linux_x86_64 \
-f https://download.pytorch.org/whl/cu124/ # Force CUDA 12.4 wheelsOr use conda-lock, which handles this more elegantly by generating platform-specific lock files that account for CUDA versions automatically.
The Dependency Management Challenge: Reproducibility and Drift
Getting Docker right for ML isn't just about size and performance. It's about reproducibility - the ability to build the exact same image six months from now and have it behave identically. This seems obvious until you're debugging why a model trained in an old container differs subtly from one trained in a new container, and you discover the underlying library versions drifted. The differences are small - different BLAS implementations, different compiler flags - but they accumulate into meaningfully different training behavior.
The danger is version creep. Even if your Dockerfile specifies "install PyTorch," without pinned versions, pip fetches whatever is latest. Build today and you get PyTorch 2.1.5. Build in three months and you get 2.3.0. The new version probably has bug fixes and performance improvements, but it's different. Your training might converge differently. Inference outputs might have different numerical properties. For research reproducibility, this is unacceptable. For production stability, it's concerning.
The fix is discipline: pin every transitive dependency. Use pip-compile to generate a lock file where every package and every dependency of every package is pinned to a specific version. Check this lock file into version control. This ensures everyone using the image - whether building today or six months from now - gets identical packages.
But the practical challenge is getting this discipline right with CUDA. Different CUDA versions have different compatible wheels. PyTorch publishes wheels for CUDA 11.8, 12.1, 12.4. If your lock file specifies torch==2.1.0 without specifying CUDA version, you might get the wrong wheel. The solution is platform-specific lock files or explicit wheel indices. Some teams use conda-lock which handles this elegantly by generating platform-specific lock files automatically.
The culprit is loose dependency pinning. If your Dockerfile says RUN pip install torch, pip will fetch the latest PyTorch version available. Build today, and you get torch 2.1.5. Build in three months, and you might get torch 2.3.0. The new version probably has bug fixes and performance improvements, but it's also different. Your training might converge differently. Inference outputs might have different numerical properties.
This is why the Dockerfile pattern we showed uses pip-compile to generate a lock file. Every transitive dependency is pinned to an exact version. Build today or in six months, you get the identical packages. It's boring, but boring is good. Boring means reproducible.
But dependency management gets harder when you have CUDA-specific wheels. PyTorch distributes different wheels for CUDA 11.8, 12.1, 12.4, and CPU. If your Dockerfile specifies torch==2.1.0 without platform specifiers, you might get the generic CPU wheel even though you need CUDA support. That leads to runtime failures: GPU kernels don't compile, or worse, the code silently falls back to CPU.
The fix is platform-specific lock files or explicit wheel index specification. With pip-compile, you use the -f flag to point to NVIDIA's PyTorch wheel index. With conda-lock, you generate platform-specific lock files that account for CUDA versions automatically. It's more steps, but the alternative is subtle platform mismatches in production.
There's also the question of base image stability. NVIDIA publishes nvidia/cuda:12.4.1-runtime-ubuntu22.04. What happens when NVIDIA removes that image? Container registries don't guarantee perpetual access. If you're using an image that's been deprecated, pulling it in 2029 might fail. Professional teams handle this by archiving base images internally or pinning to specific SHA digests instead of tags.
Layer Caching Optimization: The Build Speed Hack
Docker builds layer by layer. Each RUN, COPY, or ADD instruction creates a new layer. If nothing changed since the last build, Docker uses the cached layer. If something changed, it rebuilds that layer and all layers below it.
Understanding Docker's caching strategy is crucial because it directly impacts developer productivity. A developer waiting 15 minutes for an image to build after a small code change gets frustrated. They skip testing. They ship broken images. They test locally instead of in containers, and production fails because their local environment is different. The friction introduced by slow builds changes behavior in ways that hurt reliability.
The caching is deterministic. Docker computes a hash of each instruction and its inputs. If the hash matches the cache, Docker uses the cached layer. This works reliably as long as instructions are deterministic. But if your instruction runs pip install -r requirements.txt and the requirements file hasn't changed, Docker uses the cached pip installation. Even if new versions of packages were released upstream, you get the old cached version. This is usually good (stability), but sometimes bad (you want the security update but don't want to update requirements.txt to force a rebuild).
BuildKit, Docker's modern build system, improves caching dramatically. It supports persistent caches across builds (so if you need PIL and you installed it in a previous build, the installation is cached and reused). This can reduce build times from 15 minutes to 2-3 minutes on changes that only touch code.
ML-Specific Runtime Concerns
Your container needs more than just GPU access. Here are the gotchas:
This matters for your build time.
The key insight is understanding layer ordering and cache invalidation. Every instruction you write creates a new layer, and Docker compares each instruction to the cache. If any part of an instruction changes (the command itself, the working directory, environment variables), that layer and all subsequent layers are invalidated.
In practical terms, this means the order of instructions in your Dockerfile dramatically affects build speed. Instructions that change frequently should come last. Instructions that rarely change should come first. This isn't just optimization - it's the difference between 30-second rebuilds and 15-minute rebuilds on an active development team.
Wrong order in your Dockerfile:
FROM base
COPY . /app # ← Copy all code first
RUN pip install -r requirements.txt # ← If code changes, rebuild dependenciesIf you change one line of code, Docker throws away the cached dependency layer and rebuilds it. 15 minutes wasted.
Right order:
FROM base
COPY requirements.txt /app/ # ← Copy only dependencies
RUN pip install -r requirements.txt # ← Cache this layer
COPY . /app # ← Copy code lastNow code changes don't touch the dependency cache. Your build goes from 15 minutes to 30 seconds.
Pro move: use BuildKit's cache mount feature:
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txtThis keeps the pip cache between builds. Even when you change dependencies, pip's cache speeds up wheel downloads. Real builds drop from 8 minutes to 2-3 minutes.
Enable BuildKit:
export DOCKER_BUILDKIT=1
docker build .ML-Specific Runtime Concerns
Your container needs more than just GPU access. Here are the gotchas:
Shared Memory for DataLoaders
PyTorch DataLoaders with num_workers > 0 use multiprocessing. They share data through shared memory (/dev/shm). By default, Docker allocates only 64MB. Train a model with 4 workers, and you hit that limit immediately.
docker run --gpus all \
--shm-size=2gb \ # ← Critical for DataLoaders
-it your-ml-image python train.pyWithout --shm-size, you get cryptic errors about "no space left on device" even though your disk is empty. I've spent hours debugging this in production. Don't make that mistake.
Non-Root User
Running as root inside a container is a security antipattern. Create a dedicated user:
RUN useradd -m -u 1000 mluser
USER mluserThis prevents privilege escalation if your code is compromised. Most cloud platforms enforce this anyway.
Health Checks
If you're running a model inference server, add a health check:
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1Orchestration platforms (Kubernetes, Docker Swarm) use this to detect hung servers and restart them automatically. Without it, a deadlocked model service keeps running invisibly.
Resource Limits
In multi-tenant environments, set CPU and memory limits:
docker run --gpus all \
--cpus=4 \
--memory=16g \
--shm-size=2gb \
your-ml-imagePrevents one job from starving others.
The Complete Production Dockerfile
Putting it all together, here's a real-world Dockerfile for a PyTorch model server:
# ============================================================================
# Stage 1: Base CUDA Runtime (7.2GB)
# ============================================================================
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS base
# Install minimal system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 \
python3.11-venv \
python3.11-dev \
ca-certificates \
curl \
git \
&& rm -rf /var/lib/apt/lists/*
RUN ln -s /usr/bin/python3.11 /usr/bin/python
# ============================================================================
# Stage 2: Builder (temp, not shipped)
# ============================================================================
FROM base AS builder
# Install build tools
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
&& rm -rf /var/lib/apt/lists/*
# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Copy dependency files
COPY requirements.txt .
# Install with pip cache mount (BuildKit)
RUN --mount=type=cache,target=/root/.cache/pip \
pip install --upgrade pip && \
pip install -r requirements.txt
# ============================================================================
# Stage 3: Runtime (Final, ~3.8GB)
# ============================================================================
FROM base AS runtime
# Copy virtual environment
COPY --from=builder /opt/venv /opt/venv
# Set environment
ENV PATH="/opt/venv/bin:$PATH" \
PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
CUDA_DEVICE_ORDER=PCI_BUS_ID
# Create non-root user
RUN groupadd -g 1000 mluser && \
useradd -m -u 1000 -g mluser mluser
# Copy application (as non-root)
WORKDIR /app
COPY --chown=mluser:mluser . .
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# Switch to non-root user
USER mluser
# Run inference server
CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Build with BuildKit enabled:
DOCKER_BUILDKIT=1 docker build -t ml-model:v1.0 .Check your size:
docker images | grep ml-model
# ml-model v1.0 abc123def456 3.8GBRun with GPU and shared memory:
docker run --gpus all \
--shm-size=2gb \
--cpus=4 \
--memory=16g \
-p 8000:8000 \
ml-model:v1.0Production Patterns: CI/CD Integration and Automation
In production, you're not manually building and running containers. You're integrating with a CI/CD pipeline. Here's how real teams do it:
GitHub Actions Example:
name: Build and Push ML Image
on:
push:
branches: [main]
paths: ['requirements.txt', 'Dockerfile', 'app/**']
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
# Set up Docker Buildx for multi-stage builds
- uses: docker/setup-buildx-action@v2
# Build and push to ECR (AWS)
- uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ${{ secrets.AWS_REGISTRY }}/ml-model:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
BUILDKIT_INLINE_CACHE=1This workflow:
- Triggers on changes to code or requirements
- Builds your image with BuildKit
- Pushes to your container registry
- Caches intermediate layers for future builds (massive speedup)
Kubernetes Deployment Integration:
Once your image is pushed, your Kubernetes config pulls and runs it:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
spec:
replicas: 3
selector:
matchLabels:
app: ml-inference
template:
metadata:
labels:
app: ml-inference
spec:
# Schedule on GPU nodes
nodeSelector:
gpu: "true"
containers:
- name: model-server
image: myregistry.azurecr.io/ml-model:latest
imagePullPolicy: Always # Always pull latest
# GPU resources
resources:
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "2"
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
# Shared memory for DataLoaders
volumeMounts:
- name: shared-memory
mountPath: /dev/shm
# Liveness and readiness probes
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: shared-memory
emptyDir:
medium: Memory
sizeLimit: 2Gi # Critical for DataLoadersThe volumes section creates in-memory storage for shared memory - Kubernetes native way to handle what we did with --shm-size=2gb in Docker.
The Reproducibility Guarantee: Why Docker Matters for ML
One of Docker's superpowers for ML is reproducibility. Train a model in your Docker container today, run the same container in production three months from now, and you'll get identical behavior - same libraries, same compiler flags, same everything.
This might seem obvious, but it's powerful. In traditional software, code reproducibility is straightforward - you version your source, and you can rebuild it identically. But ML models depend on dozens of libraries, compiled extensions, even system libraries. Run the same Python code with PyTorch 2.0 versus 2.1, and you might get slightly different results due to different implementations of operations. Run with CUDA 12.1 versus 12.4, and you might get different rounding behavior.
Docker captures all of this. When you build an image, you're capturing the exact state of every library, every compiled extension, the OS version, the CUDA version, everything. When that image runs, it's identical every single time. This is critical for ML teams trying to reproduce research results, debug production models, or retrain models using the exact training conditions.
The reproducibility also extends to dependencies. A pip freeze captures Python dependencies, but what about system libraries? What about the specific version of GLIBC your model linked against? Docker captures all of it. Your image is a hermetically sealed environment where everything is pinned.
This is why having a good base image strategy matters so much. If you use pytorch:latest, your image will change every month as new PyTorch versions are released. If you use pytorch:2.1.0-cuda12.4-cudnn8-runtime, you're pinned to that exact version. When you rebuild the image six months later, you get the same base. Reproducibility preserved.
Common Pitfalls and How to Avoid Them
"My model runs on my machine but not in Docker"
Almost always a CUDA version mismatch. Use pip-compile with an explicit wheels directory:
pip-compile requirements.in \
-f https://download.pytorch.org/whl/cu124/"The container starts but inference hangs"
Missing health check or your model initialization is hanging. Add logging:
RUN echo 'import logging; logging.basicConfig(level=logging.DEBUG)' >> /app/app.py"Build takes forever even though I didn't change code"
You're not using BuildKit. Enable it:
export DOCKER_BUILDKIT=1
docker build ."DataLoader crashes with 'no space left on device'"
You need --shm-size. Docker's default is 64MB, which is nothing for multiprocessing. Use at least 1-2GB:
docker run --shm-size=2gb your-image"Image is still 10GB and I followed the guide"
Check what's in your final stage:
docker history ml-model:v1.0Each layer should be small. If you see a 5GB layer, you copied something big that shouldn't be there. Likely culprit: you didn't use multi-stage separation correctly.
Security Considerations
Always scan your images for vulnerabilities:
# Using Trivy (open source)
trivy image myregistry/ml-model:latest
# Fix: rebuild with updated base image
docker pull nvidia/cuda:12.4.1-runtime-ubuntu22.04Secrets Management:
Never hardcode API keys, database passwords, or model tokens in your Dockerfile. Use secrets:
# Bad
RUN export HUGGINGFACE_TOKEN=hf_xyz123
# Good
ARG HUGGINGFACE_TOKEN
RUN --mount=type=secret,id=hf_token \
export HUGGINGFACE_TOKEN=$(cat /run/secrets/hf_token) && \
pip install -r requirements.txtBuild with:
docker build \
--secret hf_token=/path/to/token.txt \
-t ml-model:v1.0 .Secrets aren't baked into the image - they're only available during the build.
Going Further: Registries and Caching
Once you have a lean image, push it to a container registry (Docker Hub, ECR, GCR). Use image tagging for reproducibility:
docker tag ml-model:v1.0 myregistry/ml-model:pytorch-2.1-cuda124
docker push myregistry/ml-model:pytorch-2.1-cuda124On your GPU server, pulling a 3.8GB image is dramatically faster than 15GB. And if you're deploying to cloud (Kubernetes, Lambda, etc.), the cost difference is real.
Monitoring and Debugging Docker ML Containers
You've built and deployed your image. Now it's running on a GPU server somewhere. How do you know if it's actually healthy?
Real-time GPU Monitoring:
# Run this inside your container during inference
nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,memory.used,memory.total \
--format=csv,noheader --loop=1But you want this in your app. Here's a monitoring pattern:
import subprocess
import json
from prometheus_client import Gauge
gpu_utilization = Gauge('gpu_utilization_percent', 'GPU utilization', ['gpu_id'])
gpu_memory_used = Gauge('gpu_memory_used_mb', 'GPU memory used', ['gpu_id'])
def monitor_gpu():
try:
result = subprocess.run(
['nvidia-smi', '--query-gpu=index,utilization.gpu,memory.used,memory.total',
'--format=csv,noheader'],
capture_output=True, text=True
)
for line in result.stdout.strip().split('\n'):
gpu_id, util, mem_used, mem_total = line.split(', ')
gpu_utilization.labels(gpu_id=gpu_id).set(float(util.rstrip('%')))
gpu_memory_used.labels(gpu_id=gpu_id).set(float(mem_used.rstrip(' MiB')))
except Exception as e:
print(f"GPU monitoring error: {e}")
# Call this in your health check or main loop
monitor_gpu()Export these metrics to Prometheus, and now you can alert on GPU memory creep (sign of a memory leak) or sustained high utilization (sign of saturation).
The Checklist
Before shipping your ML Docker image to production:
- Multi-stage Dockerfile with builder + runtime
- CUDA version pinned and matched to your requirements.txt
- requirements.txt generated via pip-compile or conda-lock (locked dependencies)
- BuildKit enabled during build
- Non-root user created and running
- Health check defined if this is a service
- --shm-size in your docker run command (if using DataLoaders)
- GPU access tested with nvidia-smi
- Image size verified (should be <5GB)
- Build time verified (should be <5min with cache)