November 24, 2025
AI/ML Infrastructure MLOps

Version Control for ML: Code, Data, and Models

You've built a great machine learning model. It hits 94% accuracy on your test set. You're thrilled. Three weeks later, someone asks: "Which dataset trained that model? What hyperparameters did you use? Can we reproduce it?" You dig through Jupyter notebooks, old Slack messages, and half-forgotten experiment logs. This is the problem that kills ML projects - not the models themselves, but the inability to answer when you trained something, what data went into it, and why it performed the way it did.

Here's the painful truth: Git alone is insufficient for machine learning. Git was built for source code, which is text-heavy and stable. Your ML pipeline-pipelines-training-orchestration)-fundamentals)) deals with gigabyte-sized datasets, binary model files, hyperparameter variations, and metrics that change with every experiment. Throwing a 500 MB model file into Git creates bloated repositories and ruins performance for the entire team. You need a system that versions code, data, and models as a unified experiment snapshot - and that's exactly what we're building in this guide.

Table of Contents
  1. Why Git Alone Falls Short for ML
  2. The Unified Versioning Stack: Git + DVC + MLflow
  3. Part 1: Setting Up DVC for Data and Model Versioning
  4. Step 1: Initialize DVC
  5. Step 2: Configure Remote Storage
  6. Step 3: Version Your Dataset
  7. Step 4: Define a DVC Pipeline
  8. Part 2: Experiment Tracking with MLflow
  9. Step 1: Initialize MLflow Tracking
  10. Step 2: View Experiment Results
  11. Step 3: Connect DVC to MLflow Logging
  12. Part 3: Model Registry and Promotion Workflow
  13. Step 1: Register a Model
  14. Step 2: Check Registry Status
  15. Step 3: Promote to Staging
  16. Step 4: Production Promotion Workflow
  17. Common Pitfalls in ML Version Control
  18. The "Accidental Large File" Disaster
  19. The "dvc.lock Merge Conflict" Nightmare
  20. The "Orphaned Experiments" Problem
  21. The "Data Drift Without Lineage" Trap
  22. Part 4: Versioning Conventions That Scale
  23. Dataset Versioning Convention
  24. Model Versioning: Semantic Versioning
  25. Deprecation Policies
  26. Production Considerations: From Dev to Serving
  27. Immutable Model Registry Across Environments
  28. Rollback Safety: Always Keep N-1 in Production
  29. Drift Detection from Version Control
  30. Putting It All Together: An End-to-End Example
  31. The Adoption Curve: From Chaos to Maturity
  32. The Hidden Costs of ML Version Control
  33. Key Takeaways
  34. Organizational Challenges in ML Version Control: Real-World Complexity
  35. Scaling Version Control Beyond Single Teams
  36. Long-term Maintenance and Cleanup
  37. Sources

Why Git Alone Falls Short for ML

Let's be concrete about Git's limitations when applied to machine learning projects.

Binary File Bloat: Git stores every version of every file you commit. When you commit a 200 MB neural network model, then change one hyperparameter and train again, Git now stores 400 MB. Commit ten variations, and your .git directory becomes a multi-gigabyte nightmare. Cloning the repo takes forever. Pushing to remote storage is painful.

No Experiment Context: Git tracks code changes through diffs and commit messages. But it doesn't know that you trained a model with learning rate 0.001, batch size 32, and Adam optimizer. Those details live in scattered configuration files, notebook cells, or your brain. Version the code, not the experimental choices.

Dataset Evolution: Your training data isn't static. You clean it, add new samples, fix labels. With Git, you'd version the entire CSV or Parquet-columnar-data-ml) file, which defeats the purpose. You need data lineage - knowing which specific dataset (with its version hash) produced which model.

Disconnected Metadata: Git commits happen when you push code. But you train models asynchronously, on different machines, sometimes with GPU clusters. The Git commit hash and your model training aren't naturally linked. You manually have to stitch this together.

The research is clear: according to industry surveys, teams using only Git for ML versioning spend 3-5x longer debugging failed experiments compared to teams using specialized tools. We're going to fix that.

The Unified Versioning Stack: Git + DVC + MLflow

We solve this with a three-layer architecture:

  • Git versioning your code, configs, and small reference files
  • DVC) (Data Version Control) tracking datasets and models with remote storage
  • MLflow logging experiments, metrics, and model registry metadata

These three tools work together seamlessly. Here's the conceptual model:

graph TB
    subgraph Git["Git Repository"]
        Code["Code (.py)"]
        Config["Config (params.yaml)"]
        DVCRefs["DVC Refs (.dvc files)"]
    end
 
    subgraph DVC["DVC & Remote Storage"]
        DataCache["Data Cache"]
        ModelCache["Model Cache"]
        S3["S3/GCS Remote"]
    end
 
    subgraph MLflow["MLflow"]
        Experiments["Experiments & Runs"]
        Registry["Model Registry"]
        Artifacts["Artifacts & Logs"]
    end
 
    Git -->|dvc.lock| DVC
    DVC -->|push/pull| S3
    Code -->|triggers| Experiments
    Experiments -->|logs to| MLflow
    MLflow -->|registers| Registry
    Registry -->|links to| Git

Here's what this means in practice: when you commit code changes, you're also committing .dvc files that point to specific versions of datasets and models stored in S3. When you train an experiment, MLflow automatically logs the Git commit hash, the dataset hash from DVC, and all your metrics in one place. Three months later, you can reproduce the exact training run - code version, data version, and model version - with a single command.

Part 1: Setting Up DVC for Data and Model Versioning

Let's get hands-on. We'll set up DVC with S3 as remote storage (substitute GCS or Azure if you prefer).

Step 1: Initialize DVC

bash
pip install dvc dvc-s3 mlflow
cd your-ml-project
git init  # if not already a git repo
dvc init

What happens: DVC creates a .dvc/ directory in your project (similar to .git/). A .dvc/config file appears where you'll configure remote storage.

Step 2: Configure Remote Storage

bash
dvc remote add -d myremote s3://my-ml-bucket/dvc-store
dvc remote modify myremote profile myawsprofile  # or use default AWS credentials
dvc remote list  # verify it's set

Expected output:

myremote	s3://my-ml-bucket/dvc-store (default)

Now DVC knows where to push/pull large files. Commit this configuration:

bash
git add .dvc/config
git commit -m "Configure DVC remote storage"

Step 3: Version Your Dataset

Your training data lives in data/raw/customer_data.csv (let's say it's 300 MB).

bash
dvc add data/raw/customer_data.csv

What happens: DVC computes a hash of the file, stores the actual data locally in .dvc/cache/, and creates data/raw/customer_data.csv.dvc (a small ~100 byte metadata file). The .dvc file contains the hash and remote location.

bash
cat data/raw/customer_data.csv.dvc

Output:

yaml
outs:
  - md5: a1b2c3d4e5f6a1b2c3d4e5f6
    size: 314572800
    hash: md5
    path: customer_data.csv

Add this to Git:

bash
git add data/raw/customer_data.csv.dvc
git add .gitignore  # DVC auto-adds the .csv to .gitignore
git commit -m "Version raw customer dataset"

Push the actual data to remote storage:

bash
dvc push

Now the 300 MB file is in S3, and your Git repo is lightweight. Your teammate can clone the repo and fetch the data:

bash
dvc pull  # downloads the 300 MB file based on the .dvc metadata

Step 4: Define a DVC Pipeline

Create a dvc.yaml file that defines your training workflow:

yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw/customer_data.csv
      - src/prepare.py
    outs:
      - data/processed/train.csv
      - data/processed/test.csv
    params:
      - prepare.train_split
 
  train:
    cmd: python src/train.py
    deps:
      - data/processed/train.csv
      - src/train.py
    params:
      - train.learning_rate
      - train.epochs
    outs:
      - models/model.pkl
    metrics:
      - metrics/metrics.json:
          cache: false

And your params.yaml:

yaml
prepare:
  train_split: 0.8
 
train:
  learning_rate: 0.001
  epochs: 50
  batch_size: 32

Now run the entire pipeline-pipeline-automated-model-compression):

bash
dvc repro

What DVC does: It checks which files have changed since the last run. It executes only the affected stages (in order: prepare → train). It records the exact output hashes in dvc.lock.

dvc.lock (auto-generated):

yaml
prepare:
  cmd: python src/prepare.py
  deps:
    - path: data/raw/customer_data.csv
      hash: md5
      md5: a1b2c3d4e5f6
  outs:
    - path: data/processed/train.csv
      hash: md5
      md5: x9y8z7w6v5u4
train:
  cmd: python src/train.py
  deps:
    - path: data/processed/train.csv
      hash: md5
      md5: x9y8z7w6v5u4
  outs:
    - path: models/model.pkl
      hash: md5
      md5: p3q2r1s0t9u8
  metrics:
    - metrics/metrics.json:
        hash: md5
        md5: abc123def456

This file is sacred - it links code, data, and model versions in one place. Commit it:

bash
git add dvc.yaml params.yaml dvc.lock
git commit -m "Define DVC pipeline: data prep and training"

Push models to DVC remote:

bash
dvc push

Now your Git repo contains all the information (code, configs, metadata), but the artifacts (datasets, models) live safely in S3. Perfect for team collaboration.

Part 2: Experiment Tracking with MLflow

DVC handles reproducibility (which exact code + data + model?). MLflow handles experimentation (what were the metrics for each run?). Let's connect them.

Step 1: Initialize MLflow Tracking

Create a simple training script that logs to MLflow:

python
# src/train_with_mlflow.py
import mlflow
import mlflow.sklearn
import json
import yaml
import os
import subprocess
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
 
# Get the current Git commit hash
git_commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('utf-8').strip()
 
# Load params
with open('params.yaml') as f:
    params = yaml.safe_load(f)
 
# Load data
import pandas as pd
X_train = pd.read_csv('data/processed/train.csv').drop('target', axis=1)
y_train = pd.read_csv('data/processed/train.csv')['target']
 
X_test = pd.read_csv('data/processed/test.csv').drop('target', axis=1)
y_test = pd.read_csv('data/processed/test.csv')['target']
 
# Start MLflow run
mlflow.set_experiment("customer-churn-prediction")
 
with mlflow.start_run(description="Random Forest with default params"):
    # Log code context
    mlflow.log_param("git_commit", git_commit)
    mlflow.log_param("git_branch", subprocess.check_output(['git', 'rev-parse', '--abbrev-ref', 'HEAD']).decode('utf-8').strip())
 
    # Log hyperparameters from params.yaml
    mlflow.log_params(params['train'])
 
    # Train model
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=params['train'].get('max_depth', 10),
        random_state=42
    )
    model.fit(X_train, y_train)
 
    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
 
    # Log metrics
    mlflow.log_metric("test_accuracy", accuracy)
    mlflow.log_metric("test_f1", f1)
 
    # Log model artifact
    mlflow.sklearn.log_model(model, "model", registered_model_name="customer-churn-rf")
 
    # Log metrics file
    metrics = {"accuracy": accuracy, "f1": f1}
    mlflow.log_dict(metrics, "metrics/final_metrics.json")
 
    print(f"Run logged. Accuracy: {accuracy:.4f}, F1: {f1:.4f}")

Run it:

bash
python src/train_with_mlflow.py

What happens: MLflow creates a "run" with a unique ID, logs your parameters and metrics, and records the Git commit hash automatically. All stored in the local mlruns/ directory by default.

Step 2: View Experiment Results

bash
mlflow ui

Open http://localhost:5000 in your browser. You'll see a table of runs with parameters, metrics, and registered models. Every run is linked to the Git commit that triggered it.

Step 3: Connect DVC to MLflow Logging

Modify your training script to also log DVC's data hash:

python
# In train_with_mlflow.py, add:
import yaml as yaml_lib
 
# Load dvc.lock to get data hash
with open('dvc.lock') as f:
    dvc_lock = yaml_lib.safe_load(f)
 
train_data_hash = dvc_lock['prepare']['outs'][0]['md5']
model_hash = dvc_lock['train']['outs'][0]['md5']
 
# In the MLflow run, log these:
mlflow.log_param("dvc_train_data_hash", train_data_hash)
mlflow.log_param("dvc_model_hash", model_hash)

Now MLflow knows:

  • Which Git commit (code version)
  • Which DVC dataset hash (data version)
  • Which DVC model hash (model version)
  • All the metrics and parameters

This is full lineage.

Part 3: Model Registry and Promotion Workflow

MLflow's Model Registry is your central approval system for moving models from development to production.

Step 1: Register a Model

When you trained above, we used registered_model_name="customer-churn-rf". That automatically registers the model in the MLflow registry.

Step 2: Check Registry Status

bash
mlflow models search

Or view in the UI under the "Models" tab.

Step 3: Promote to Staging

When a model passes evaluation, promote it:

bash
mlflow models list

Get the run ID from your best experiment. Then:

python
from mlflow.tracking import MlflowClient
 
client = MlflowClient()
 
# Transition a model version to "Staging"
client.transition_model_version_stage(
    name="customer-churn-rf",
    version=1,
    stage="Staging"
)

Step 4: Production Promotion Workflow

graph TB
    Dev["Development<br/>(Latest experimental run)"]
    Staging["Staging<br/>(Candidate for prod)<br/>- Validate on holdout data<br/>- A/B test if needed"]
    Prod["Production<br/>(Served to users)<br/>- Monitor metrics<br/>- Define rollback triggers"]
    Archived["Archived<br/>(Old versions)"]
 
    Dev -->|promote if metrics OK| Staging
    Staging -->|promote after validation| Prod
    Prod -->|rollback if degradation| Staging
    Prod -->|retire when obsolete| Archived
    Staging -->|reject if fails tests| Dev

In practice:

python
# Promote to Production when you're confident
client.transition_model_version_stage(
    name="customer-churn-rf",
    version=1,
    stage="Production"
)
 
# Rollback if metrics degrade
client.transition_model_version_stage(
    name="customer-churn-rf",
    version=1,
    stage="Staging"
)
 
# Archive old versions
client.transition_model_version_stage(
    name="customer-churn-rf",
    version=0,
    stage="Archived"
)

Your serving code checks the registry:

python
# Load production model
model_uri = "models:/customer-churn-rf/Production"
model = mlflow.sklearn.load_model(model_uri)
predictions = model.predict(new_data)

Common Pitfalls in ML Version Control

Before we get into conventions, let me be direct: there are easy ways to mess this up. Teams implement 90% of Git + DVC + MLflow, miss one thing, and spend weeks in version control hell.

The "Accidental Large File" Disaster

You run git add . to commit your code changes. Then you realize you've accidentally committed a 500 MB model file. Now your Git repo is bloated forever. The file is in the history; deleting it from HEAD doesn't remove it from earlier commits. You have to use git filter-branch, which rewrites history and breaks everyone's local repos.

The fix: add a .gitignore before anyone commits anything:

gitignore
# .gitignore - protect your repo from bloat
*.pkl
*.h5
*.pt
*.pth
*.onnx
*.pb
*.joblib
*.msgpack
 
# Data
data/
*.csv
*.parquet
*.parquet.gzip
*.feather
 
# Large artifacts
models/
artifacts/
outputs/
.dvc/cache/
 
# IDE and OS
.DS_Store
.vscode/
*.swp
__pycache__/
*.egg-info/
.env

And immediately after git init, add DVC:

bash
git init
git add .gitignore
git commit -m "Initial commit with .gitignore"
dvc init
git add .dvc/ .gitignore
git commit -m "Initialize DVC"

This prevents the problem before it starts.

The "dvc.lock Merge Conflict" Nightmare

Two data scientists work on different branches. They both run dvc repro, which generates dvc.lock. They merge their branches into main. Git can't merge dvc.lock because both versions changed multiple lines. You get a merge conflict with hundreds of lines.

The rule: DVC maintainers should review and merge dvc.lock changes. Never let automatic merge tools touch it.

In your Git workflow:

bash
# You modify params.yaml and run dvc repro
dvc repro
git add params.yaml dvc.lock src/train.py
git commit -m "Tune learning rate"
git push
 
# Data person reviews your changes
# They verify dvc.lock makes sense before merging
# They handle the merge manually if needed

For large teams, use DVC's conflict resolution tools-real-time-ml-features)-apache-spark))-training-smaller-models)/data-versioning#merge-conflicts):

bash
# DVC can merge dvc.lock intelligently
dvc plots diff --target metrics.json

The "Orphaned Experiments" Problem

You run 50 experiments in MLflow, but only commit 3 to Git. Six months later, you forget which Git commit corresponds to which MLflow run. The metadata becomes useless.

The discipline: every MLflow experiment must be backed by a Git commit. Before you log to MLflow, you've already committed your code and params:

bash
# 1. Modify params
vim params.yaml
 
# 2. Commit to Git
git add params.yaml
git commit -m "Experiment: increase learning rate"
 
# 3. Run pipeline
dvc repro
 
# 4. Log to MLflow (the commit hash is already in Git)
python src/train_with_mlflow.py

This way, your MLflow run always knows which Git commit it came from (logged automatically).

The "Data Drift Without Lineage" Trap

You train a model on dataset v1. Three months later, someone updates the dataset (fixes bugs, adds new samples). Your model still references the old dataset through DVC, but you don't notice the lineage is broken. You retrain on v2 and deploy. It fails in production because the data distribution changed.

The fix: add explicit lineage checks in your training script:

python
import dvc.api
import yaml
 
def verify_data_lineage():
    """Check that training data matches expected version"""
 
    # Load the commit's dvc.lock
    with open('dvc.lock') as f:
        dvc_lock = yaml.safe_load(f)
 
    train_data_hash = dvc_lock['prepare']['deps'][0]['md5']
 
    # Compare against expected (from Git tag or config)
    with open('expected_data_versions.yaml') as f:
        expected = yaml.safe_load(f)
 
    if train_data_hash not in expected['approved_versions']:
        raise ValueError(
            f"Data hash {train_data_hash} not in approved versions. "
            f"Approved: {expected['approved_versions']}"
        )
 
    print(f"✓ Data lineage verified: {train_data_hash}")
 
# In your training script
verify_data_lineage()

Then maintain a registry of approved data versions:

yaml
# expected_data_versions.yaml
approved_versions:
  - a1b2c3d4e5f6 # Initial release
  - f6e5d4c3b2a1 # Added field validation
  - x9y8z7w6v5u4 # Fixed label errors

When you update the dataset, you explicitly add it to the approved list after validation.

Part 4: Versioning Conventions That Scale

As your team grows, you need naming disciplines. Here's what works in practice.

Dataset Versioning Convention

data/
├── raw/
│   ├── 2024-01-15_customers_v1.csv  (date + version)
│   └── 2024-02-20_customers_v2.csv  (fixed address bugs)
├── processed/
│   ├── train_v2_hash_a1b2c3d4.csv
│   └── test_v2_hash_a1b2c3d4.csv

Always include:

  1. Date: When the data was collected/published
  2. Version number: Incremented when substantive changes occur
  3. Hash: The first 8 characters of the file's MD5 (for uniqueness in logs)

In DVC:

yaml
# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py --input-date 2024-02-20
    outs:
      - data/processed/train_v2_hash_a1b2c3d4.csv

Model Versioning: Semantic Versioning

Adopt semantic versioning for released models:

1.0.0   - First production release
1.0.1   - Patch: minor bug fix, no retraining
1.1.0   - Minor: retrained on same data, improved metrics
2.0.0   - Major: architecture change or data distribution shift

Tag releases in Git:

bash
git tag -a "model-customer-churn-1.1.0" -m "Improved F1 to 0.92"
git push origin "model-customer-churn-1.1.0"

And in MLflow:

python
mlflow.log_param("model_version", "1.1.0")
mlflow.log_param("release_date", "2024-02-20")
mlflow.set_tag("severity", "minor")  # use tags for release type

Deprecation Policies

As a team practice, establish when models are deprecated:

yaml
# models/deprecation_policy.yaml
deprecation_rules:
  archived_models:
    - if_age_days: 180
      then: delete_from_production
    - if_age_days: 365
      then: move_to_cold_storage
 
  performance_degradation:
    - if_metric_drop_percent: 5
      then: rollback_and_alert
    - if_metric_drop_percent: 10
      then: immediately_rollback

Automated monitoring catches regressions:

python
# monitoring/check_model_drift.py
import mlflow
from datetime import datetime, timedelta
 
client = MlflowClient()
prod_model = client.get_latest_versions("customer-churn-rf", stages=["Production"])[0]
prod_version = prod_model.version
prod_metrics = client.get_run(prod_model.run_id).data.metrics
 
current_accuracy = evaluate_on_latest_data(prod_model.uri)
 
if current_accuracy < prod_metrics['test_accuracy'] * 0.95:  # 5% drop
    alert("Model accuracy dropped! Rollback recommended.")

Production Considerations: From Dev to Serving

Version control in isolation is useless. The real value is in connecting development to production. Here's what separates amateur teams from professional ones.

Immutable Model Registry Across Environments

You have three environments: Dev, Staging, Production. Models move through them. The cardinal rule: once a model is promoted to an environment, its version is immutable.

This prevents the "I swear I deployed v1.0 but production is running v1.1" nightmare:

python
# In your serving code (production)
import mlflow
 
# Load by stage, never by version number
model = mlflow.sklearn.load_model("models:/customer-churn-rf/Production")
 
# This guarantees you get exactly what was promoted to Production
# If someone tries to switch models, they must go through the MLflow registry

For safety, use environment variables to specify which environment the code is running in:

bash
# In your Docker container or Lambda environment
export MLFLOW_TRACKING_URI="https://mlflow.mycompany.com"
export MLFLOW_MODEL_REGISTRY_STAGE="Production"  # Never hardcode

And validate in code:

python
import os
from mlflow.tracking import MlflowClient
 
client = MlflowClient()
stage = os.getenv("MLFLOW_MODEL_REGISTRY_STAGE", "None")
 
if stage not in ["Staging", "Production"]:
    raise ValueError(f"Invalid stage: {stage}. Must be Staging or Production.")
 
model = mlflow.sklearn.load_model(f"models:/customer-churn-rf/{stage}")

Rollback Safety: Always Keep N-1 in Production

Never delete the previous production model version. If the new version fails, you need to rollback instantly. Keeping the old model lets you do that without rebuilding from source:

python
# When promoting a new model
client = MlflowClient()
 
# Get current production
prod_versions = client.get_latest_versions("customer-churn-rf", stages=["Production"])
current_prod_version = prod_versions[0].version
 
# Before promoting new version, archive the older one (if any)
for v in client.search_model_versions(f"name='customer-churn-rf' and stage='Archived'"):
    if int(v.version) < current_prod_version - 1:
        client.delete_model_version("customer-churn-rf", v.version)
 
# Promote new version to Production
new_version = 5
client.transition_model_version_stage(
    "customer-churn-rf",
    new_version,
    "Production"
)
 
# Keep current version in Staging (for quick rollback)
client.transition_model_version_stage(
    "customer-churn-rf",
    current_prod_version,
    "Staging"
)

Then your rollback procedure is:

python
# If production model degrades, rollback immediately
client.transition_model_version_stage(
    "customer-churn-rf",
    5,  # The failing version
    "Archived"
)
client.transition_model_version_stage(
    "customer-churn-rf",
    4,  # The previous version
    "Production"
)
# Serving code automatically picks up version 4 next request

Drift Detection from Version Control

Your model was trained on a specific dataset (hash stored in DVC). Production data is now different (data drift). Without version control, you wouldn't notice. With it, you can detect and alert:

python
import dvc.api
import yaml
from datetime import datetime
 
def detect_data_drift():
    """Compare current production data to training data"""
 
    # What data was the model trained on?
    repo = "."
    with dvc.api.get_url("dvc.lock", repo=repo) as dvc_lock_url:
        # In real code, you'd parse dvc.lock from Git
        training_data_hash = "a1b2c3d4"  # From dvc.lock
 
    # What data is production currently using?
    current_data_hash = compute_hash(get_current_data())
 
    if training_data_hash != current_data_hash:
        alert(f"Data drift detected! Model trained on {training_data_hash}, "
              f"but production is using {current_data_hash}")
 
        # Log to monitoring system
        log_drift_event({
            "model_version": "1.1.0",
            "training_data": training_data_hash,
            "current_data": current_data_hash,
            "timestamp": datetime.utcnow().isoformat(),
            "action": "alert_on_call_team"
        })

This catches problems proactively, before your model predictions become garbage.

Putting It All Together: An End-to-End Example

Here's a realistic workflow from a data scientist's perspective:

bash
# 1. Clone repo—get code only
git clone https://github.com/myteam/customer-churn.git
cd customer-churn
 
# 2. Fetch the versioned data and models
dvc pull
 
# 3. Try a new hyperparameter
vim params.yaml  # change learning_rate to 0.01
 
# 4. Run the pipeline
dvc repro
 
# 5. View metrics
dvc metrics show
 
# 6. Log to MLflow
python src/train_with_mlflow.py
 
# 7. Check the experiment
mlflow ui  # view at http://localhost:5000
 
# 8. If metrics improved, commit everything
git add params.yaml dvc.lock src/train_with_mlflow.py
git commit -m "Tune learning_rate to 0.01: accuracy 0.94 → 0.956"
dvc push
 
# 9. Someone (a lead) reviews the run in MLflow
# They see the metrics, the Git commit, the data hash, everything.
 
# 10. They promote the model to Staging
mlflow models promote \
  --model-uri "models:/customer-churn-rf/Production" \
  --stage Staging
 
# 11. Run validation tests (automated or manual)
python tests/validate_model.py --stage staging
 
# 12. Promote to Production
mlflow models promote \
  --model-uri "models:/customer-churn-rf/Staging" \
  --stage Production
 
# 13. Your inference service loads it
# model = mlflow.sklearn.load_model("models:/customer-churn-rf/Production")

No guessing. No manual version tracking. No lost experiments. Everyone knows exactly what's running in production.

The Adoption Curve: From Chaos to Maturity

Most teams don't adopt Git + DVC + MLflow all at once. They evolve through stages, and understanding where you are in that evolution helps you know what to focus on next.

Stage 1: Git Only (The Starting Point)

You're using Git for code, but not for data or models. Datasets are stored on shared network drives or in S3 with no version control. Models are picked based on timestamps or email conversations. This is the stage where most small teams start. It works for a while - maybe for the first 5-10 experiments - but quickly becomes chaos. No one remembers which dataset was used for which model. Someone retrains the model and gets different results; it's unclear why. Multiple people are working on the same datasets and stomping on each other's changes. The problems are real but not acute enough to force change yet.

The fix is straightforward: start using DVC immediately. Pick your largest dataset first and put it under DVC. You'll immediately see the value. Cloning the repo becomes fast again (because the 500MB model file isn't in Git). Everyone has access to the canonical dataset version. When someone asks "which dataset trained model X," you can answer definitively. This alone is worth the effort of learning DVC.

Stage 2: Git + DVC (Better, But Not Complete)

You've got code and data versioned, and you're tracking models in DVC. But you're not tracking experiments systematically. You train a model, get an AUC of 0.94, and save it. Three weeks later, someone trains a similar model and gets 0.96. Which one is better? Can you reproduce either one? You probably have notebooks lying around with the training code, but it's unclear which notebook produced which model. The metadata is scattered and incomplete.

The fix is MLflow. Start logging experiments to MLflow immediately, even if you're just logging a few basic metrics. Write a simple script that loads your MLflow runs and displays a table of experiments, metrics, and which Git commit produced each model. Suddenly, your experiment history becomes queryable. You can ask "which experiments on the Q1 2025 dataset achieved AUC > 0.95?" and get an answer immediately. You can compare two models side by side and see exactly what hyperparameters were different.

Stage 3: Git + DVC + MLflow (Mature)

You've got everything integrated. Every experiment is logged. Every model artifact is versioned. Data versions are tracked. You can answer any question about any model: which data, which code, which hyperparameters, which metrics. Your Git history, DVC history, and MLflow history all align. This is the state you want to reach.

But even here, most teams have blind spots. Do you know which experiments are actually being used in production? Are you monitoring for data drift between your training data and production data? Are you retraining regularly and tracking those retraining experiments? Have you set up automatic rollbacks if a new model degrades in production? These are the questions that take you from competent to mature.

The Hidden Costs of ML Version Control

Implementing proper version control for ML sounds straightforward until you actually do it and discover all the hidden operational costs. We talk about the benefits, but let's be honest about the costs too.

Disk Space: Every version of every dataset you ever trained on lives in your DVC cache and on your remote storage. If you retrain weekly and each model is 500MB, you're accumulating 500MB per week. After a year, that's 26GB just for model artifacts. With datasets, it multiplies. If your training dataset is 50GB and you're creating a new version monthly as you add data, you're using 600GB per year just on that one dataset. Remote storage costs accumulate. You need policies for when to delete old versions. But you can't delete them too aggressively because you might need to reproduce an old experiment. The result is an ongoing operational burden of managing storage, deciding what to keep and what to delete, and dealing with the costs.

Operational Complexity: Every team member needs to understand Git, DVC, and MLflow. That's three separate tools with three separate mental models. You need documentation, training, and ongoing support. When someone gets DVC into a weird state - out of sync with remote storage, local cache corrupted - you need someone on your team who can diagnose and fix it. This adds friction to onboarding new team members. It also creates knowledge bottlenecks where only a few people understand how to untangle things when they go wrong.

Integration Complexity: Getting Git, DVC, and MLflow to play nicely together requires careful orchestration. If you're using continuous training - automatically retraining models on a schedule - you need to ensure that the CI/CD system properly coordinates Git commits, DVC pushes, and MLflow logging. It's easy to end up in states where a model is registered in MLflow but the corresponding DVC files aren't committed to Git, or vice versa. These desynchronizations are hard to detect and fix. Many teams end up writing additional monitoring and validation code to ensure their version control systems stay in sync.

Reproducibility Overhead: True reproducibility in ML is hard. It's not enough to have the code and data versioned. You need to lock exact dependency versions, random seeds, the exact hardware configuration, and sometimes even the CUDA version and driver version. If you're serious about reproducibility, you end up checking in requirements.txt with pinned versions, you might pin your Python version, and you definitely need your training script to set random seeds explicitly. This adds overhead but is necessary for true reproducibility. Many teams find this too burdensome and settle for "mostly reproducible" - close enough that the models behave similarly but not byte-for-byte identical. This is often fine, but it's a compromise worth being explicit about.

Cost of Mistakes: One advantage of sloppy versioning is that mistakes are cheap. If you mistakenly commit a bad dataset version or train a model on buggy code, only you are affected. With proper versioning, mistakes propagate faster. If you register a bad model to the registry and it gets promoted to production, multiple teams are affected. If you commit a bad dataset version and it gets used in multiple retraining pipelines, the damage is multiplicative. The flip side is that with proper versioning, you can fix problems more quickly - because you know exactly what's wrong and can easily revert to a previous good version. But the window of damage is usually larger. This is a real tradeoff worth acknowledging.

Key Takeaways

Git + DVC + MLflow is the gold standard for ML version control. Git handles code. DVC handles data lineage and reproducible pipelines. MLflow handles experiment tracking and model registry.

The three are designed to integrate: Git commits link to DVC's dvc.lock file, which records data and model hashes. MLflow logs the Git commit hash automatically and links it to metrics, hyperparameters, and registered models. Together, they create a complete audit trail.

Without this, you're flying blind. You'll spend more time debugging "which dataset trained this?" than actually improving your models. With this system in place, you can confidently say: "That model used dataset hash X, code commit Y, hyperparameters Z, and achieved metric M. Here's how to reproduce it."

That's professional ML. That's what shipping real models looks like.

Adopt it gradually. Start with Git if you haven't already. Add DVC when you have large data or models. Add MLflow when you're running enough experiments that you can't track them manually. Build the integration over time. Don't expect to get it perfect on day one. Version control is a journey, not a destination. But every step along the way, you'll gain confidence and capability in your ability to build reliable ML systems.

Organizational Challenges in ML Version Control: Real-World Complexity

Operating Git plus DVC plus MLflow in production teaches hard lessons that academic literature rarely covers. The straightforward architecture-production-deployment-deployment)-guide) becomes complicated the moment you have multiple teams, multiple models, and interdependencies. Consider the common scenario: your recommendation team maintains a feature pipeline that generates user embeddings. Five other teams depend on these embeddings - they train classification models, ranking models, and churn prediction models on top of them. The recommendation team improves the embedding computation. Now what? Do they create a new feature version? Do they break backward compatibility? The question sounds simple but cascades into operational complexity. If they change the embedding format without versioning, all downstream models break simultaneously. If they maintain multiple versions, storage and compute requirements balloon. If they migrate gradually, they need a transition period where both versions coexist, which is messy.

The version control problem in this scenario isn't technical - it's organizational. You need clear policies about which versions are supported, which models are allowed to depend on which versions, and what happens when you deprecate a version. Some teams operate "always latest" policies where all models automatically use the newest feature version. This keeps things simple but breaks whenever features change unexpectedly. Other teams pin versions, specifying "this model uses feature embeddings v2.3.1." This prevents surprise breaks but means you need a deprecation timeline. Embedding v2.3.1 works fine for three years, then you want to retire it. Now you have to migrate seven teams off that version - a coordination burden that grows quadratically with the number of dependencies.

Smart teams solve this by treating feature pipelines like published APIs. Version them explicitly. Document breaking changes. Maintain multiple major versions in parallel. Create a roadmap for deprecation. But this adds significant operational overhead. You're not just versioning code anymore; you're running a platform for other teams to depend on. The moment you cross that threshold - where versioning decisions made by one team affect downstream teams - the entire scaling story changes. Version control is no longer about "reproducibility for my team." It's about "managing distributed systems where changes propagate across organization boundaries." This is where many teams get stuck, because the technical solutions (Git + DVC + MLflow) are solid, but they don't solve the organizational coordination problem.

Scaling Version Control Beyond Single Teams

As organizations grow and multiple teams work on interdependent models, version control becomes genuinely complex. A recommendation team's feature extraction pipeline might be a dependency for five other teams' models. If the recommendation team updates the pipeline (adding a feature, fixing a bug), how do downstream teams know? Do they need to retrain their models? Do their models break because they depend on a feature that's no longer computed?

This is the concept of feature versioning. Instead of versioning just the feature code, you version the feature outputs themselves. Using tools like Tecton or Feast)), you maintain a feature store where features are versioned alongside their lineage. When you update the recommendation pipeline, the feature store creates a new version of those features. Downstream teams can choose to upgrade to the new features (and retrain their models) or stay on the old version. The version control system tracks these dependencies.

Implementing feature versioning requires buy-in from multiple teams and careful coordination. It's not something a single team can do in isolation. But once established, it enables organizations to evolve their feature engineering without breaking downstream models. It also enables debugging: if a model degraded after a specific date, you can check if feature versions changed on that date and investigate whether that's the cause.

For very large organizations (especially those training hundreds of models), feature versioning becomes as important as model versioning. The infrastructure investment pays off through reduced debugging time, faster iteration, and reduced risk of cascading failures.

Long-term Maintenance and Cleanup

Version control systems accumulate cruft over time. You have model versions that are no longer used. You have experiments that are archived. You have datasets that are no longer needed. Without active cleanup, your version control systems become bloated and slow.

Establish policies for when to delete old versions. A model that's been archived for a year is probably safe to delete. A dataset that's no longer used by any active models can probably be removed. A failed experiment from two years ago can be archived to cold storage. But implement these policies carefully - you might need to resurrect an old model version for debugging or regulatory compliance. Document what you delete and why. Some teams use tiered storage: recent versions in fast storage (S3 standard), older versions in cold storage (S3 Glacier). This balances cost and accessibility.



Sources

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project