Data Versioning with DVC and LakeFS
You're building a machine learning pipeline-pipelines-training-orchestration)-fundamentals)). Your model trains beautifully on Tuesday's dataset. By Thursday, someone updates the data - and suddenly your model's predictions drift by 15%. You scramble to figure out what changed, when it changed, and how to reproduce Tuesday's results.
This is the data versioning problem. It's unsexy, it doesn't appear in research papers, but it's the friction point that separates prototype teams from production teams.
In this article, we'll explore two powerful approaches to solving it: DVC (Data Version Control), which brings Git-like workflow to large datasets, and LakeFS, which adds branching and isolation to your object storage. We'll see how they work, when to use each, and how to build them into your ML infrastructure.
Table of Contents
- The Data Versioning Problem
- DVC: Git-Style Data Versioning
- How DVC Works
- Setting Up DVC
- DVC Pipelines and Lineage
- Branching and Tagging Datasets
- LakeFS: Copy-On-Write Branching for Object Storage
- How LakeFS Works
- Setting Up LakeFS
- Copy-On-Write Magic
- Merging and Conflict Resolution
- Lineage Tracking: From Data to Model
- DVC Lineage
- LakeFS Lineage
- Combining DVC and LakeFS Lineage
- CI/CD Integration: Automating Retraining
- DVC + CI/CD
- LakeFS + CI/CD
- Advanced CI/CD Patterns
- Comparison Matrix: DVC vs LakeFS vs Alternatives
- Storage Overhead Deep Dive
- Real-World Cost Analysis
- Practical Example: End-to-End Data Versioning
- Setup with DVC
- Setup with LakeFS
- Choosing Between DVC and LakeFS
- Integration Patterns
- Common Integration Mistakes
- The Cost of Not Versioning: A Cautionary Tale
- Wrapping Up
The Data Versioning Problem
Let's get specific about what we're dealing with here.
Traditional version control systems like Git excel at text. They're terrible at 50GB Parquet files. If you try to commit a 100GB dataset to Git, you'll:
- Bloat your repository to several hundred gigabytes
- Make cloning and pulling glacially slow
- Store duplicate copies of nearly-identical dataset versions
- Destroy your CI/CD pipeline-pipeline-parallelism)-automated-model-compression)
But your ML pipeline depends on knowing exactly which data your model trained on. Reproduce a result from six months ago? You need that exact dataset version. Debug a silent data quality issue? You need to diff versions.
The cost of getting this wrong is staggering. I've seen teams spend weeks debugging model degradation only to discover it wasn't the model - it was a silent data change. A single field that changed format. An upstream data source that got updated. A batch of records that got deduplicated. Any of these can shift model behavior, and without data versioning, you're flying blind.
This is where DVC and LakeFS enter the picture. They solve the same problem from different angles:
- DVC treats data like code. You store lightweight pointers in Git and the actual data in object storage (S3, GCS, etc.). Version your datasets with tags and branches, just like code.
- LakeFS gives your object storage Git-like semantics directly. Create branches, merge changes, and isolate experiments on S3 without duplicating data through copy-on-write.
Both approaches let you version data, but they make different tradeoffs around complexity, storage overhead, and integration depth. Understanding these tradeoffs is crucial because picking the wrong tool early means painful migration later.
DVC: Git-Style Data Versioning
How DVC Works
DVC operates on a simple principle: store pointers in Git, data in object storage.
Here's the workflow:
- You add a large file (
dataset.csv) to your project - DVC computes its SHA-256 hash and creates a
.dvcfile containing that hash - DVC uploads the actual data to S3 (or GCS, Azure, local storage, etc.)
- You commit the
.dvcfile to Git - The
.dvcfile acts as a pointer - pull the repo, rundvc pull, and DVC fetches the matching data
This approach gives you:
- Git integration: The
.dvcfiles are text, so they diff cleanly - Branching and tagging: Create dataset versions as easily as code versions
- Pipelines: Link data changes to model retraining automatically
The genius of this design is its simplicity. You're not learning a new versioning system; you're leveraging Git's battle-tested workflow for data. Your infrastructure teams already understand Git branching, merging, and conflict resolution. DVC just extends those concepts to large files.
Setting Up DVC
Let's walk through a practical example. You're tracking a training dataset for a recommendation system.
# Initialize DVC in your repo
dvc init
# Configure remote storage (S3 in this case)
dvc remote add -d myremote s3://my-ml-bucket/dvc-storage
# Add your dataset
dvc add data/training_set.parquet
# This creates data/training_set.parquet.dvc
cat data/training_set.parquet.dvc
# Outputs:
# outs:
# - md5: a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6
# size: 52428800
# path: training_set.parquet
# Push to remote storage
dvc push
# Commit the .dvc file (not the data)
git add data/training_set.parquet.dvc
git commit -m "Add initial training dataset v1"Now your teammate pulls the repo:
# Cloning gets the .dvc files, not the data
git clone <repo>
# Fetch the actual data
dvc pull
# training_set.parquet is now available locallyThis is beautifully simple, but it hides sophisticated machinery. DVC is managing cache consistency, computing hash hierarchies, and coordinating between multiple backends. The simplicity is the point - you shouldn't have to think about how it works.
DVC Pipelines and Lineage
The real power emerges when you connect data versions to model training. Use dvc.yaml to define your pipeline:
# dvc.yaml - Define your ML pipeline
stages:
prepare:
cmd: python scripts/prepare.py
deps:
- data/raw_data.csv
outs:
- data/prepared.parquet
train:
cmd: python scripts/train.py
deps:
- data/prepared.parquet
- scripts/train.py
params:
- train.epochs
- train.learning_rate
metrics:
- metrics/accuracy.json:
cache: false
plots:
- metrics/confusion_matrix.csv:
cache: false
outs:
- models/model.pkl
evaluate:
cmd: python scripts/evaluate.py
deps:
- data/prepared.parquet
- models/model.pkl
metrics:
- metrics/eval_results.json:
cache: falseNow you can run your entire pipeline:
# Run all stages that need rerunning
dvc repro
# Outputs:
# Stage 'prepare' didn't change, skipping
# Running stage 'train'...
# Running stage 'evaluate'...
# Pipeline complete!When you change the input data, DVC knows which stages to rerun. Better yet, you get a deterministic DAG of your ML workflow. This is invaluable for debugging and auditing. If someone asks "what model did we use to predict that value?", you can trace backward through the DAG and answer definitively.
Branching and Tagging Datasets
Version your datasets like code:
# Create a dataset version
dvc tag data/training_set.parquet -d v1.0
# Or use Git tags to capture entire dataset+model pairs
git tag -a v1.0-dataset -m "Training set version 1.0"
git push origin v1.0-dataset
# Checkout a specific version
git checkout v1.0-dataset
dvc checkout # Restores data from that point in timeThis is powerful for comparing models trained on different dataset versions - or reproducing results from months ago. You're not just versioning code; you're versioning the complete computational provenance.
LakeFS: Copy-On-Write Branching for Object Storage
How LakeFS Works
LakeFS takes a different approach. Instead of working alongside object storage, it becomes an abstraction layer on top of S3 (or S3-compatible services). It gives your object storage Git-like semantics without duplicating data.
Here's the mental model:
- You upload data to LakeFS (which stores it on S3)
- You create branches - isolated versions of your data lake
- Changes on a branch use copy-on-write (only changed objects consume new storage)
- You can merge branches, revert commits, and audit the entire history
LakeFS uses a metadata layer to track versions and branches, keeping the actual object data efficient. The key insight: metadata is cheap, but data is expensive. So LakeFS stores only metadata for unchanged objects, and duplicates only what changes.
Setting Up LakeFS
LakeFS requires a bit more infrastructure than DVC, but the payoff is significant for teams managing large data lakes.
# Install LakeFS (Docker quickstart)
docker run -it \
-p 8000:8000 \
-v /tmp/lakefs:/home/lakefs/data \
treeverse/lakefs:latest
# Access the UI at http://localhost:8000
# Create a repository connected to your S3 bucketOnce configured, you interact with LakeFS via its Python SDK or REST API:
from lakefs_sdk import client
# Connect to LakeFS
api_client = client.ApiClient()
api_client.configuration.username = "ACCESS_KEY_ID"
api_client.configuration.password = "SECRET_ACCESS_KEY"
# Create a branch for an experiment
branches_api = client.BranchesApi(api_client)
branches_api.create_branch(
repository="my-datalake",
branch_creation={
"name": "experiment-v2",
"source": "main"
}
)
# Work on the branch
# Upload data, transform it, whatever you needCopy-On-Write Magic
Here's where LakeFS differs fundamentally from DVC. When you create a branch, LakeFS doesn't duplicate data. Instead:
- Metadata about the branch structure is stored
- Objects are shared between branches
- Changes trigger copy-on-write - only modified objects get new storage
To understand why this matters, consider what happens with traditional branching. If you have a 500GB dataset and branch it five times for five concurrent experiments, you'd consume 2.5TB of storage - one copy per branch. With copy-on-write, you get:
- Initial state: All five branches reference the same objects, plus metadata (~50GB)
- After modifications: Only changed objects are duplicated. If each experiment modifies 10% of the data, you add ~250GB total
- Final storage: ~750GB instead of 2.5TB
This efficiency multiplier grows as you add more branches. For organizations running dozens of concurrent ML experiments, the cost savings are substantial. LakeFS tracks metadata relationships at the object level, enabling this elegant sharing without sacrificing isolation or data integrity.
Let's visualize this:
graph TD
A["S3 Bucket"]
B["LakeFS<br/>Metadata Layer"]
C["main branch<br/>pointers"]
D["exp-v2 branch<br/>pointers"]
E["Shared<br/>Objects"]
A -->|stores| E
B -->|tracks| C
B -->|tracks| D
C -->|references| E
D -->|references<br/>mostly same| E
style B fill:#4CAF50
style E fill:#2196F3If your dataset is 100GB and you branch to create an experiment, that branch costs nearly zero additional storage initially. Only the metadata overhead and any new objects you upload consume extra space.
Merging and Conflict Resolution
LakeFS supports merging branches with conflict detection:
# Merge experimental branch back to main
branches_api.merge_branches(
repository="my-datalake",
source_branch="experiment-v2",
destination_branch="main",
merge_request={
"message": "Merge cleaned dataset from experiment-v2"
}
)
# If there are conflicts, they'll be reported
# You can resolve them before committingThis is crucial for multi-team data pipelines where different teams update the data lake simultaneously. Without merge semantics, you lose data. With them, you can coordinate safely.
Lineage Tracking: From Data to Model
Both DVC and LakeFS enable lineage tracking, but they approach it differently. Lineage is critical in production ML systems. When a model makes a bad prediction, you need to trace back: What exact data version was it trained on? What preprocessing steps were applied? Did any data quality issues slip through? Without comprehensive lineage, debugging is guesswork. With it, you have a full chain of custody.
DVC Lineage
DVC tracks lineage through dvc.yaml pipelines. Each stage declares its dependencies and outputs:
stages:
create_features:
cmd: python scripts/features.py
deps:
- data/raw.csv
- scripts/features.py
outs:
- data/features.parquet
train_model:
cmd: python scripts/train.py
deps:
- data/features.parquet
- scripts/train.py
outs:
- models/model.pkl
- metrics/eval.jsonDVC builds a DAG from this. Query it:
# See the dependency graph
dvc dag
# Output:
# +-----------+
# | raw.csv |
# +-----------+
# |
# +------+--------+
# | |
# |features.py |
# | |
# +-----------+
# |features |
# +-----------+
# |
# +------+--------+
# | |
# |train.py |
# | |
# +------+--------+
# |
# model.pkl
# Ask "which data version produced this model?"
dvc plots diff --targets models/model.pkl v1.0 HEADThis is invaluable for auditing. You can prove exactly which dataset version trained which model.
LakeFS Lineage
LakeFS tracks lineage through commit metadata. Each commit to a branch includes:
- The objects changed
- A commit message
- Timestamp
- Author
- Parent commit
# Get commit history for a branch
commits_api = client.CommitsApi(api_client)
commits = commits_api.log_commits(
repository="my-datalake",
ref="main",
amount=10
)
for commit in commits.results:
print(f"{commit.id}: {commit.message}")
print(f" Author: {commit.committer}")
print(f" Time: {commit.creation_date}")For deeper lineage, many teams integrate LakeFS with tools like OpenLineage to track end-to-end data and model dependencies. LakeFS handles the data lake's history, while OpenLineage captures the broader pipeline lineage.
Combining DVC and LakeFS Lineage
Some mature teams use both tools together for complete lineage coverage:
- LakeFS tracks raw data changes: which dataset version, when it changed, who approved it
- DVC tracks feature engineering and model training: which raw data fed into features, which features went into the model
- OpenLineage (optional) stitches them together: entire lineage from raw data to prediction
This layered approach provides maximum observability but requires coordination. The trade-off: more infrastructure, but bulletproof audit trails.
CI/CD Integration: Automating Retraining
Both tools integrate into CI/CD to trigger model retraining when data changes. This is where data versioning moves from a nice-to-have to a must-have in production systems. You want changes to data flowing through your pipeline automatically - validated, tested, and retrained - without manual intervention.
In practice, this means:
- Detecting data changes automatically (new files, modifications, schema changes)
- Running validation before retraining (quality checks, schema compliance)
- Triggering model training with the correct data version
- Comparing metrics against the previous model to ensure no regression
- Deploying only if metrics pass thresholds
DVC + CI/CD
DVC's CI/CD integration is straightforward. On a pull request that modifies data:
# .github/workflows/train.yml
name: Train on Data Changes
on:
pull_request:
paths:
- "data/**"
- "dvc.yaml"
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: iterative/setup-dvc@v1
- name: Pull data
run: dvc pull
- name: Run pipeline (dry run)
run: dvc repro --no-commit
- name: Log metrics
run: |
echo "## Model Metrics" >> $GITHUB_STEP_SUMMARY
cat metrics/eval.json >> $GITHUB_STEP_SUMMARYThe --no-commit flag runs your entire pipeline without pushing results - perfect for validating that a data change doesn't break your training.
LakeFS + CI/CD
With LakeFS, you trigger retraining when branches are merged:
# Lambda function or webhook handler
def on_merge_to_main(event):
"""Triggered when a branch merges to main"""
merge_details = event['details']
source_branch = merge_details['source']
destination = merge_details['destination']
if destination == 'main':
# Trigger retraining job
trigger_training_job(
dataset_branch='main',
dataset_version=merge_details['commit_id']
)This pattern works well for teams managing multiple concurrent data updates. Only merge to main when you're confident the data is clean, and the merge itself triggers validation and retraining.
Advanced CI/CD Patterns
Both tools support sophisticated automation patterns:
Feature branch validation (DVC):
# Developer opens PR with new dataset
# CI automatically:
# 1. Checks out the branch
# 2. Runs dvc repro
# 3. Computes metrics
# 4. Posts metrics comparison in PR
# 5. Blocks merge if metrics regress >2%Scheduled retraining (LakeFS):
# Nightly job checks for new commits on 'main'
# If changes detected:
# 1. Snapshot the current main branch
# 2. Trigger training job with that snapshot
# 3. Compare to production model
# 4. Auto-deploy if performance improvesThese patterns eliminate manual data management and ensure your models train on consistent, validated data.
Comparison Matrix: DVC vs LakeFS vs Alternatives
Let's ground this with concrete comparisons. Here's how DVC and LakeFS stack up against other solutions:
graph LR
A["Data Versioning<br/>Solutions"]
A --> B["DVC"]
A --> C["LakeFS"]
A --> D["Delta Lake"]
A --> E["Apache Iceberg"]
B --> B1["✓ Git native<br/>✓ Lightweight<br/>✗ Data owned separately<br/>✗ No branching"]
C --> C1["✓ Real branches<br/>✓ Copy-on-write<br/>✓ Audit trail<br/>✗ Separate infrastructure"]
D --> D1["✓ ACID transactions<br/>✓ Schema evolution<br/>✗ Parquet-specific<br/>✗ Not git-native"]
E --> E1["✓ Time travel<br/>✓ Hidden partitions<br/>✓ Schema on read<br/>✗ High metadata overhead"]
style B fill:#FF6B6B
style C fill:#4ECDC4
style D fill:#45B7D1
style E fill:#FFA07AHere's the detailed comparison:
| Feature | DVC | LakeFS | Delta Lake | Iceberg |
|---|---|---|---|---|
| Git Integration | Native | REST API only | No | No |
| Branching | Git branches | First-class | No | No |
| Copy-on-Write | No (full copy) | Yes | No | No |
| Audit Trail | Commit history | Full audit log | Transaction log | Full manifest history |
| Setup Complexity | Low | Medium-High | Medium | Medium-High |
| Storage Overhead | 0% | 5-15% (metadata) | ~5% | ~10-15% |
| Latency | <100ms | 50-200ms | <100ms | 100-300ms |
| Query Engine Support | Limited | Spark, Presto | Spark, Flink, Trino | Spark, Flink, Presto |
| ML/Data Science UX | Excellent | Good | Fair | Fair |
Storage Overhead Deep Dive
Let's quantify storage costs. Assume a 500GB dataset with 10% daily changes over 30 days:
DVC Approach:
- Initial: 500GB (in S3)
- After 30 days: 500GB × 30 versions = 15TB (if you keep all versions)
- Or, garbage collect old versions and keep 5: 500GB × 5 = 2.5TB
LakeFS Approach:
- Initial: 500GB (in S3)
- Metadata overhead: ~25GB (5% of data × 30 versions)
- Changed objects only: 50GB × 30 days = 1.5TB
- Total: 500GB + 25GB + 1.5TB = ~2TB
For this scenario, LakeFS saves ~15% storage if you keep full history, and ~20% if you keep 5 versions with DVC.
However, if you use DVC with selective version retention and garbage collection, the picture changes. The real advantage of LakeFS emerges in heavily branched workflows where multiple experiments run simultaneously.
Real-World Cost Analysis
To ground this further, here's what typical organizations pay:
Small team (5 data scientists, <1TB data):
- DVC: ~$50/month S3 + negligible DVC overhead
- LakeFS: $200-300/month (server, metadata storage, monitoring)
- Winner: DVC (10x cheaper)
Medium team (20 engineers, 10-50TB):
- DVC: ~$500/month S3 + git storage, but sprawling version duplication = ~$2000/month effective cost
- LakeFS: ~$1500/month (single server, efficient storage, proper lineage)
- Winner: LakeFS (better value at scale)
Large enterprise (100+ teams, >1PB):
- DVC: Governance nightmare, storage taxes explode, often $50k+/month
- LakeFS: Scales to petabytes, $5-10k/month, audit trail is priceless for compliance
- Winner: LakeFS (not even close)
The inflection point is typically around 20-30TB of data with regular branching workflows. Below that, DVC's simplicity wins. Above that, LakeFS's efficiency and governance features pay for themselves.
Practical Example: End-to-End Data Versioning
Let's walk through a real scenario: you're building a recommendation model, and you want to version both the training data and experiment branches safely.
Setup with DVC
# Initialize project
dvc init
dvc remote add -d s3remote s3://my-ml-bucket/dvc
# Add initial dataset
dvc add data/interactions.csv
git add data/interactions.csv.dvc
git commit -m "Initial interaction dataset"
# Create a feature engineering pipeline
cat > dvc.yaml << 'EOF'
stages:
build_features:
cmd: python scripts/build_features.py
deps:
- data/interactions.csv
- scripts/build_features.py
outs:
- data/features.parquet
train:
cmd: python scripts/train.py
deps:
- data/features.parquet
params:
- train.lr
- train.batch_size
outs:
- models/v1.pkl
metrics:
- metrics/train.json:
cache: false
evaluate:
cmd: python scripts/evaluate.py
deps:
- models/v1.pkl
- data/test.csv
metrics:
- metrics/eval.json:
cache: false
EOF
# Push all data to S3
dvc push
# Tag this as v1.0
git tag -a v1.0 -m "First production model"
dvc tag data/interactions.csv -d v1.0Now, a teammate wants to experiment with a new data cleaning approach:
# Create experiment branch
git checkout -b feature/better-cleaning
# Update the raw data
dvc add data/interactions.csv
# Run the pipeline
dvc repro
# See what changed
dvc metrics diff main
# Push experiment data
dvc push
git add data/interactions.csv.dvc metrics/
git commit -m "Test new data cleaning"
git push origin feature/better-cleaningSetup with LakeFS
The same workflow, but with LakeFS's branching:
# Create a LakeFS repo connected to S3
from lakefs_sdk import client
api = client.ApiClient()
api.configuration.host = "http://localhost:8000"
api.configuration.username = "AKIAIOSFODNN7EXAMPLE"
api.configuration.password = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
# Create the repository
repos_api = client.RepositoriesApi(api)
repos_api.create_repo(
repo_creation=client.RepositoryCreation(
name="ml-data",
storage_namespace="s3://my-ml-bucket/lakefs-data"
)
)
# Upload initial dataset to main
objects_api = client.ObjectsApi(api)
with open('data/interactions.csv', 'rb') as f:
objects_api.upload_object(
repository="ml-data",
branch="main",
path="interactions.csv",
content=f
)
# Create a branch for experiments
branches_api = client.BranchesApi(api)
branches_api.create_branch(
repository="ml-data",
branch_creation=client.BranchCreation(
name="feature/better-cleaning",
source="main"
)
)
# Work on the experiment branch
# Upload new data, transform it
with open('data/interactions_cleaned.csv', 'rb') as f:
objects_api.upload_object(
repository="ml-data",
branch="feature/better-cleaning",
path="interactions.csv",
content=f
)
# Commit the changes
commits_api = client.CommitsApi(api)
commits_api.commit(
repository="ml-data",
branch="feature/better-cleaning",
commit_creation=client.CommitCreation(
message="Apply improved data cleaning"
)
)
# Once validated, merge back to main
branches_api.merge_branches(
repository="ml-data",
source_branch="feature/better-cleaning",
destination_branch="main",
merge_request=client.MergeRequest(
message="Merge improved cleaning to production"
)
)Notice the difference: with DVC, you're managing Git branches and using DVC as a tool within those branches. With LakeFS, the data lake itself has branches - they're first-class citizens.
Choosing Between DVC and LakeFS
Here's how to decide:
Use DVC if:
- Your team is small or data science-focused
- You want minimal infrastructure overhead
- Git workflow is already your standard
- You're primarily versioning training datasets for models
- Your data is relatively small (<10TB)
Use LakeFS if:
- You're managing a large, shared data lake
- Multiple teams need isolated data branches simultaneously
- You need strong audit trails and compliance requirements
- Storage efficiency matters (copy-on-write saves cost at scale)
- You're building data products with complex branching workflows
Use both together if:
- You have a data lake (LakeFS) that feeds multiple ML pipelines (DVC)
- LakeFS manages the raw data and experiment branches; DVC versioning sits downstream in model training
- You want maximum flexibility and audit trail coverage
Integration Patterns
In practice, many organizations combine them:
One underappreciated aspect of dual-system integration is the question of synchronization. When you're running both DVC and LakeFS, they maintain independent state. DVC tracks data versions through git commits and dvc.yaml pipelines. LakeFS tracks versions through its internal commit log. If a data scientist updates a dataset through LakeFS and forgets to trigger a corresponding DVC update, you end up with a mismatch. The DVC pipeline still points to the old version, but the actual data has changed. Months later, when someone tries to reproduce a result, they get different data than the original run used, and they don't realize why their model behaves differently.
The solution is establishing synchronization points where these systems connect. Many teams implement a pattern where LakeFS is the source of truth for raw data, and DVC is the source of truth for feature engineering and model training. When new data is committed to LakeFS main, it triggers a CI/CD pipeline that updates the corresponding DVC data references and reruns feature engineering. This ensures the two systems stay in sync. The orchestration overhead is worth it for organizations with many concurrent users, because it prevents the drift that inevitably occurs when systems are maintained independently.
Another integration consideration is the question of backward compatibility. If you're already using DVC and your data is checked into git (perhaps in a Git LFS setup), migrating to LakeFS is not a trivial operation. You need to copy all existing versions to LakeFS, establish the branch history, and update all downstream references. Teams that have attempted this migration report it taking weeks, not days. The lesson is choosing your data versioning strategy early, because switching is expensive.
Raw Data Lake handled by LakeFS feeds into a Feature Store tracked by DVC, which then flows into Model Training orchestrated through DVC pipelines, ultimately connecting to a Model Registry.
LakeFS isolates raw data changes; DVC ensures reproducibility throughout the feature engineering and training pipeline.
The deeper truth about integrating these tools is understanding where each tool's strengths shine brightest in the larger ecosystem. When you're building a mature ML platform, you realize that data versioning isn't a single problem - it's actually multiple nested problems that require different solutions at different layers. At the raw data layer, you're dealing with massive volumes of potentially chaotic data ingestion from multiple sources. This is where LakeFS excels because it provides the branching and isolation semantics that let different teams work on data exploration and cleaning without stepping on each other's toes. You can create an experimental branch, run aggressive transformations, and only merge back to main when you're confident in the quality. The copy-on-write mechanism means you're not duplicating terabytes of data during this process - you're just creating metadata pointers to changed objects.
Once raw data is validated and cleaned, it flows into feature engineering. This is where DVC takes over as the natural tool. Feature pipelines are deterministic transformations - you have inputs (raw data), you have code (feature engineering scripts), you have outputs (computed features). DVC tracks all three components in a declarative pipeline definition. When a feature engineer changes a feature computation, DVC automatically detects which downstream models need retraining. When a model trains on features computed from specific raw data, DVC preserves that lineage. Six months later, when someone asks "where did this feature come from?", you have the complete chain of custody from raw data through transformations to final features.
The integration point is critical. When LakeFS exposes a data lake branch through a standard S3 interface, a DVC pipeline can treat it as just another data source. The pipeline says "my input is s3://mylake/main/features/training.parquet" and DVC handles all the versioning bookkeeping. If that S3 path changes (you switch from main branch to an experimental branch), DVC detects the change and reruns dependent stages. This creates a unified versioning story across the entire system.
Common Integration Mistakes
When combining tools, teams often hit predictable pitfalls. Understanding these patterns helps you avoid expensive mistakes.
Mistake 1: Duplicating lineage tracking often emerges when teams implement both systems without clear boundaries. You end up with DVC tracking data in dvc.yaml files with commit hashes, while LakeFS tracks with its own commit IDs and metadata. Neither system talks to the other. Fast forward three months and you're in meetings where one person is referencing "LakeFS commit 5f3a2b1d" while another references "DVC tag v2.3.1". Are they the same data version? Nobody knows. The solution is establishing a single source of truth for lineage. Usually, LakeFS becomes the authority for raw data lineage - which commits changed what, who approved them, when they happened. DVC layers on top of that, tracking which specific LakeFS commits fed into which feature computations and model training runs. You don't duplicate the tracking; you stack it intentionally. Cost of error: many hours of detective work trying to match versions later, or worse, discovering that two supposedly identical training runs actually used slightly different data.
Mistake 2: Inconsistent metadata happens when you tag data differently across systems. DVC pipelines tag features as "v1.0" based on when the feature code changed, but LakeFS commits are tagged with timestamps and user names. When you look at a trained model, you see "trained on features-v1.0" but not which specific LakeFS commits those features came from. If someone modified upstream data and re-published it with the same tag, you're now training on different data than the original run, unaware of the change. Solution: establish canonical naming conventions. If a data version is "2024-q1-prod", both systems reference that identifier. Tags flow through both LakeFS and DVC. When you query "what data fed into model-v3.2", the answer includes complete version information for every system in the chain. Cost of error: spending hours trying to correlate metadata across systems, or discovering inconsistencies after models have diverged.
Mistake 3: Storage sprawl occurs when you keep versions forever in both systems. DVC versioning means every copy of a dataset gets stored on S3. LakeFS metadata layer means you're accumulating commit history. After a year of active development, you might be paying for storage of versions nobody will ever use again - old experimental branches in DVC, superseded data versions in LakeFS. The solution is defining explicit retention policies. "We keep all production data versions forever, experimental branches for 3 months, feature snapshots for 6 months." Configure automatic deletion policies. For DVC, use garbage collection to prune unreachable versions. For LakeFS, set metadata retention limits. Document these policies so team members understand the tradeoffs. Cost of error: storage bills growing 30-50% due to accumulated old versions, or worse, discovering that you've been paying for storage of data that was already deleted due to misconfigured policies.
Mistake 4: Slow data pulls manifests when LakeFS operations add latency to otherwise fast DVC operations. When DVC tries to pull features from S3, it's fast. When DVC tries to pull from an S3 path that's actually LakeFS metadata-routed to a deep storage location with complex access patterns, the latency compounds. You discover your data pipelines have doubled in execution time for reasons that aren't obvious. The solution is ensuring S3 is truly the backend for both systems. Don't layer storage abstraction on top of abstraction. Configure LakeFS to use S3-compatible storage directly. Ensure DVC points to the same S3 bucket. When network paths are simple and direct, latency stays acceptable. Cost of error: CI/CD pipelines become noticeably slower, reducing iteration speed. If your model retraining pipeline goes from 30 minutes to 60 minutes due to data access latency, you've just made your iteration cycle half as fast.
The Cost of Not Versioning: A Cautionary Tale
The true value of data versioning becomes apparent only when you're trying to recover from its absence. Consider a realistic scenario: your team trains a model in January that achieves 92 percent accuracy on your validation set. In March, you deploy it to production. By June, accuracy has drifted to 89 percent. The model is still working, but you're losing money on every misprediction. You investigate and discover that the training data changed in February - someone updated the data pipeline, and the distribution shifted. Now you face a choice: retrain on current data (which might require rewriting feature engineering code if you didn't version it), or try to figure out what the original data was and understand what changed.
If you had data versioning in place, the investigation takes minutes. You pull the exact data version from June (February's snapshot), train a new model on it, and compare against your current model. The performance gap tells you whether the issue is data drift or model degradation. If it's data drift, you know exactly which records changed and can investigate why. If you don't have data versioning, you're guessing. You might spend weeks investigating, only to discover the problem was something obvious that would have been visible in a simple diff. The cost of not versioning isn't a one-time loss - it's repeated pain across your entire organization as models degrade and you can't diagnose why.
Organizations that mature beyond)) the "chaos engineering" phase of ML quickly discover that data versioning is cheaper than the alternative. The cost of implementing DVC or LakeFS is small. The cost of losing data history is enormous.
Wrapping Up
Data versioning is the unsexy infrastructure that separates teams that can reproduce results from those that can't. DVC and LakeFS solve the problem with different philosophies:
- DVC brings Git to your data - lightweight, familiar, perfect for data scientists
- LakeFS brings Git semantics to your storage - powerful branching, copy-on-write efficiency, ideal for data lakes
Both are mature, production-ready tools. Both integrate into modern ML workflows. The choice depends on your scale, your team's structure, and how deep data versioning runs in your infrastructure.
Start with DVC if you're just beginning. Layer in LakeFS when you're managing a shared data lake across multiple teams. And remember: the goal isn't the tool. It's the ability to say, "I can reproduce this result," with absolute confidence.