December 25, 2025
AI/ML Infrastructure Platform

Building an Internal ML Platform: Self-Service Infrastructure

You've got models in production. Maybe too many. And your data scientists are spending more time wrestling with infrastructure than actually building models. Sound familiar?

This is where an internal ML platform saves the day. We're talking about a self-service infrastructure that lets your team move fast without breaking things - or your operational budget.

Table of Contents
  1. The Problem We're Solving
  2. Why Most Platform Projects Fail
  3. The Human Element: Why Great Platforms Get Rejected
  4. The Real Cost of Platform Delays
  5. Architecture: Three Critical Abstraction Layers
  6. Layer 1: Compute Abstraction (Kubernetes + Argo Workflows)
  7. Layer 2: Storage Abstraction (S3-Compatible Artifact Store)
  8. Layer 3: Experiment Tracking Abstraction (MLflow API)
  9. Architecture Diagram
  10. Developer Experience: The Key to Platform Adoption
  11. Single CLI for Job Submission
  12. Web UI for Experiment Browsing
  13. Self-Service GPU Quota Management
  14. Notebook-to-Pipeline Promotion Workflow
  15. Backstage Integration: ML Models as Managed Services
  16. Software Catalog Entries for ML Models
  17. Tech Docs for Model Documentation
  18. Overview
  19. Performance
  20. Input Format
  21. Output Format
  22. Training Data
  23. Dependencies
  24. Deployment History
  25. Custom Scaffolder Templates for ML Project Bootstrap
  26. Multi-Tenancy and Fair Resource Sharing
  27. Kubernetes ResourceQuota Per Team Namespace
  28. GPU Quota Management with LimitRange
  29. Fair-Share Scheduling via KAI Scheduler Priority Queues
  30. Understanding Platform Adoption Patterns
  31. Platform Evolution Strategy
  32. Starting Point: Build on Kubeflow or MLflow Foundation
  33. Build Thin Abstractions Over Time
  34. Avoid Lock-In Through Adapter Patterns
  35. Measure Platform Adoption and Impact
  36. Complete Example: End-to-End ML Workflow
  37. Step 1: Bootstrap a New Project
  38. Step 2: Train Locally, Then Scale
  39. Step 3: Submit to Platform for Large-Scale Training
  40. Step 4: Review Results in Web UI
  41. Step 5: Deploy to Production
  42. The Impact
  43. Case Study: From Chaos to Velocity in Nine Months
  44. Breaking Down the Real-World Benefits
  45. The Platform's Role in Model Reliability
  46. The Hidden Cost of Not Having a Platform
  47. Why Platforms Matter Now More Than Ever
  48. Integration with Data Quality and Compliance
  49. Common Platform Anti-Patterns to Avoid
  50. Long-Term Sustainability and Cost Management
  51. Building Organizational Competency
  52. Next Steps
  53. Measuring Success
  54. The Compounding Value of Platform Investment
  55. Technical Debt and Platform Evolution
  56. Building Buy-In Across Your Organization
  57. The Long Game: Platform as Competitive Moat

The Problem We're Solving

Here's the reality: without a platform, your ML workflow looks messy. Data scientists provision their own cloud resources. Engineers manually ship models to production. Notebooks somehow become production systems (please don't let this happen). Nobody really knows what's running where, or what it costs.

An internal ML platform centralizes this chaos. It abstracts away infrastructure complexity while maintaining operator control. Think of it as a Kubernetes-native PaaS, but built specifically for ML workloads.

The key insight? Don't build everything from scratch. Instead, layer abstractions on top of existing tools like Kubernetes, MLflow, and Argo. Your developers get a clean interface. You avoid vendor lock-in. Win-win.

Why Most Platform Projects Fail

Before we design a good platform, let's talk about why most fail. They usually fail for one of three reasons:

First, too much abstraction upfront. Teams build a perfect platform that handles every edge case, every workflow, every tool. By the time it's ready, nobody wants to use it because it's overkill for simple tasks. Lesson: start with core workflows, not all possible workflows.

Second, the adoption problem. You build a great platform, but adoption is terrible because it requires rewriting all existing code. Or because your team doesn't trust it yet. Lesson: make it trivially easy to use, provide migration paths, not rewrites.

Third, the complexity problem. Your platform becomes so complex that operating it requires dedicated platform engineers. Now you've created another bottleneck. Lesson: simplicity is a feature. Use boring, battle-tested tools.

The Human Element: Why Great Platforms Get Rejected

Here's something we rarely talk about: platform adoption fails because of people, not technology. Your team has been training models their way for years. They know how to debug their notebooks. They understand their local infrastructure quirks. When))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark)-training-smaller-models)) you ask them to switch to a platform, you're asking them to relearn everything, trust new tooling, and accept that they're now dependent on a platform team for their productivity.

This creates two failure modes. First, there's the conscious rejection: your team actively avoids using the platform because it's unfamiliar or slower for their specific workflow. They build workarounds. Six months in, you've got two systems running in parallel - the platform and the ad-hoc infrastructure teams built anyway. Now you're maintaining both.

Second, there's the quiet failure: your team tries the platform, hits a limitation, gets stuck, and goes back to their old way. But they don't tell you. You think adoption is happening. Three months later, you realize nobody's actually using it.

The fix? Start with your early adopters. Find the team that's actively struggling with infrastructure. Show them how the platform solves their specific problem. Measure the win. Then expand from there with proof, not mandates.

The Real Cost of Platform Delays

When you delay launching your platform while trying to make it perfect, you're paying a hidden cost. Every month your team spends wrestling with infrastructure is a month they're not building models. On a team of ten data scientists, each person spending 20% of their time on infrastructure issues is two full-time engineers worth of productivity lost. If you can recover that 20% with a platform, that's 20% more models, 20% faster iteration, 20% more experimentation.

But this isn't just about lost productivity - it's about momentum and team morale. When scientists are fighting with infrastructure, they get discouraged. They spend energy on frustration instead of creativity. A good platform changes that dynamic entirely. Suddenly, infrastructure is invisible. They submit jobs and get results. The feedback loop tightens. Iteration accelerates. And that acceleration compounds over time.

Think about the difference between a team that can run 50 experiments per week and a team that can run 150. Over a year, that's 5,200 additional experiments. Some percentage will lead to better models, better insights, better products. The platform cost pays for itself many times over.

Architecture: Three Critical Abstraction Layers

Let's break down the architecture into three interconnected layers that handle compute, storage, and experimentation tracking.

Layer 1: Compute Abstraction (Kubernetes + Argo Workflows)

Your compute abstraction sits on top of Kubernetes. Why Kubernetes? It's the industry standard, it scales, and it handles multi-tenancy natively. But raw Kubernetes YAML is not a friendly interface for data scientists.

Enter Argo Workflows. This is your orchestration layer. Instead of kubectl apply yaml files, scientists submit training jobs through a simple CLI or web UI. Argo handles scheduling, retries, and resource management.

Here's what this looks like in practice:

python
# iNet Platform Compute Client
from inet_platform import ComputeClient
 
client = ComputeClient(namespace="data-team")
 
# Submit a training job with auto-scaling
job = client.submit_training(
    name="bert-finetuning-v2",
    image="registry.internal/ml-training:latest",
    command=["python", "train.py"],
    cpu_request="4",
    memory_request="16Gi",
    gpu_request="1",  # Automatic GPU quota management
    timeout_hours=24,
    output_artifacts={
        "model": "/workspace/checkpoints/final_model.pt",
        "metrics": "/workspace/metrics.json"
    }
)
 
# Poll for completion
while not job.is_complete():
    print(f"Status: {job.status}")
    time.sleep(10)
 
print(f"Model saved to: {job.artifacts['model']}")

Under the hood, this client generates an Argo Workflow with proper resource requests, creates the Kubernetes job, and monitors it:

yaml
# Generated Argo Workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: data-team
  name: bert-finetuning-v2
spec:
  serviceAccountName: data-team-sa
  entrypoint: training
  templates:
    - name: training
      container:
        image: registry.internal/ml-training:latest
        command: ["python", "train.py"]
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
      outputs:
        artifacts:
          - name: model
            path: /workspace/checkpoints/final_model.pt
            s3:
              bucket: ml-artifacts
              endpoint: minio.internal
              key: bert-v2/model.pt

What's happening here? The platform abstracts away YAML complexity. Your team submits jobs through Python. Kubernetes handles the actual scheduling and resource isolation.

Layer 2: Storage Abstraction (S3-Compatible Artifact Store)

Models, datasets, and intermediate training artifacts need to live somewhere. S3 is the standard. But you don't necessarily want to run on AWS - many organizations use MinIO (S3-compatible, runs on-premise) or Ceph.

Your abstraction layer provides a unified storage interface:

python
# Unified artifact storage (works with S3, MinIO, Ceph)
from inet_platform import ArtifactStore
 
store = ArtifactStore(backend="minio")
 
# Save training artifacts
store.save_artifact(
    path="bert-finetuning/v2/model.pt",
    local_file="/workspace/checkpoints/final_model.pt",
    metadata={
        "model_type": "BERT",
        "framework": "pytorch",
        "dataset": "wikitext-103",
        "hyperparams": {
            "lr": 1e-5,
            "batch_size": 32,
            "epochs": 3
        }
    }
)
 
# List artifacts in a namespace
artifacts = store.list_artifacts(prefix="bert-finetuning/")
for artifact in artifacts:
    print(f"{artifact.path} - {artifact.size_mb}MB - {artifact.created}")
 
# Output:
# bert-finetuning/v1/model.pt - 438MB - 2026-02-20 14:32:10
# bert-finetuning/v2/model.pt - 445MB - 2026-02-27 09:15:33
# bert-finetuning/v2/training_log.json - 2MB - 2026-02-27 09:18:22

Why an abstraction? Future-proofing. If you start with MinIO and later want to migrate to AWS S3, you change one line of configuration. Your code stays the same.

Layer 3: Experiment Tracking Abstraction (MLflow API)

MLflow is the de-facto standard for tracking experiments. But you want to run it inside your infrastructure, not rely on cloud-hosted solutions.

Your platform provides MLflow as a managed service:

python
# Experiment tracking via platform-managed MLflow
from inet_platform import ExperimentTracker
import mlflow
 
tracker = ExperimentTracker(workspace="bert-experiments")
 
with mlflow.start_run(run_name="bert-v2-16bit") as run:
    # Log hyperparameters
    mlflow.log_params({
        "learning_rate": 1e-5,
        "batch_size": 32,
        "precision": "float16",
        "optimizer": "AdamW"
    })
 
    # Simulate training with metrics logging
    for epoch in range(3):
        train_loss = 2.1 - (epoch * 0.3)  # Simulated improvement
        val_loss = 2.0 - (epoch * 0.25)
 
        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "epoch": epoch
        })
        print(f"Epoch {epoch}: train={train_loss:.3f}, val={val_loss:.3f}")
 
    # Log model
    mlflow.pytorch.log_model(
        pytorch_model=None,  # In real scenario
        artifact_path="model",
        code_paths=["train.py"]
    )
 
    # Log tags for filtering
    mlflow.set_tags({
        "team": "nlp",
        "production_candidate": True,
        "git_sha": "a3f8d2c"
    })
 
print(f"Run ID: {run.info.run_id}")
# Output: Run ID: e8c3f9d2-a4b1-4e2f-9c7e-3d5f8a1b2c4d
 
# Query experiments
experiments = tracker.get_experiments(tags={"production_candidate": True})
for exp in experiments:
    print(f"{exp.name}: {exp.metrics['val_loss']:.3f}")

The platform stores all of this in a central MLflow instance. Your team can browse experiments through a web UI or query them programmatically.

Architecture Diagram

Here's how it all fits together:

graph TB
    User["👤 Data Scientist"]
    CLI["CLI Client<br/>(inet platform submit)"]
    WebUI["Web UI<br/>(Job Browser)"]
 
    User -->|submits job| CLI
    User -->|views experiments| WebUI
 
    CLI -->|creates workflow| APIServer["Platform API<br/>(FastAPI)"]
    WebUI -->|queries jobs| APIServer
 
    APIServer -->|generates YAML| K8S["Kubernetes<br/>(control plane)"]
    APIServer -->|stores metadata| DB["PostgreSQL<br/>(Job metadata)"]
 
    K8S -->|executes| ArgoWF["Argo Workflows<br/>(orchestration)"]
    ArgoWF -->|spawns pod| GPU["GPU Node<br/>(nvidia/gpu)"]
 
    GPU -->|writes artifacts| S3["MinIO/S3<br/>(artifact store)"]
    GPU -->|logs metrics| MLflow["MLflow Server<br/>(experiment tracking)"]
 
    style User fill:#e1f5ff
    style CLI fill:#f3e5f5
    style WebUI fill:#f3e5f5
    style K8S fill:#fff3e0
    style ArgoWF fill:#fff3e0
    style GPU fill:#e8f5e9
    style S3 fill:#fce4ec
    style MLflow fill:#fce4ec

Developer Experience: The Key to Platform Adoption

A good platform is invisible. Developers don't think about it - they just submit jobs and get results.

Single CLI for Job Submission

Your team doesn't need to know about Kubernetes, Argo, or ConfigMaps. They use one command:

bash
# Simple job submission
inet platform train \
  --name "bert-finetuning" \
  --image "ml-training:latest" \
  --script "train.py" \
  --gpu 1 \
  --memory "16Gi" \
  --output-artifacts "model.pt"
 
# Output:
# Job submitted: bert-finetuning-abc123
# Status: Pending
# GPU quota: 1/4 (25% of team allocation)
#
# Monitor with:
#   inet platform logs bert-finetuning-abc123
#   inet platform status bert-finetuning-abc123

Web UI for Experiment Browsing

Your team accesses a Grafana-grafana-ml-infrastructure-metrics)-like dashboard to browse all experiments, compare metrics, and filter by tags:

┌─────────────────────────────────────────┐
│ ML Platform Experiments Dashboard       │
├─────────────────────────────────────────┤
│                                         │
│ Filter: [Team: NLP ▼] [Status: done ▼] │
│                                         │
│ Name              | Val Loss | GPU Time │
│─────────────────────────────────────────│
│ bert-v2-16bit     | 1.42     | 12h      │
│ bert-v2-32bit     | 1.38     | 18h      │
│ gpt-finetune-v1   | 2.01     | 24h      │
│─────────────────────────────────────────│
│                                         │
│ [Compare selected] [Export CSV]        │
└─────────────────────────────────────────┘

This UI pulls data from MLflow and your job metadata store.

Self-Service GPU Quota Management

Teams get a fixed GPU budget. They can see their usage and request more through the platform:

python
from inet_platform import QuotaManager
 
quota_mgr = QuotaManager()
 
# Check current quota
usage = quota_mgr.get_team_quota(team="nlp-team")
print(f"GPU allocation: {usage.gpu_allocated}/4")
print(f"GPU in-use: {usage.gpu_in_use}")
print(f"Available: {usage.gpu_available}")
 
# Output:
# GPU allocation: 4
# GPU in-use: 2 (bert-training, gpt-finetune)
# Available: 2
 
# Request additional quota (triggers approval workflow)
request = quota_mgr.request_quota_increase(
    team="nlp-team",
    resource="gpu",
    current=4,
    requested=8,
    justification="Scaling bert-finetuning to larger batches",
    duration_weeks=4
)
print(f"Request {request.id} submitted for approval")

Notebook-to-Pipeline Promotion Workflow

Your data scientists develop in notebooks. The platform helps them graduate to production pipelines without rewriting code:

python
# In Jupyter notebook
from inet_platform import NotebookExporter
 
# ... train model in notebook ...
 
exporter = NotebookExporter()
 
# Export notebook as production pipeline
pipeline = exporter.to_pipeline(
    notebook_path="bert_training.ipynb",
    entry_point="train_model",  # Function name in notebook
    parameters={
        "learning_rate": 1e-5,
        "batch_size": 32,
        "dataset_path": "/data/wikitext"
    },
    dependencies=["transformers>=4.25", "torch>=2.0"],
    output_artifacts=["model.pt", "metrics.json"]
)
 
# This generates an Argo Workflow template and a CLI command
print(pipeline.cli_command)
# Output: inet platform train --config bert_pipeline.yaml

The exporter handles dependency extraction, cell reordering, and parameter injection. No rewriting needed.

Backstage Integration: ML Models as Managed Services

Backstage is a platform abstraction layer for all your developer tools. You extend it to manage ML models as first-class entities.

Software Catalog Entries for ML Models

Every production model gets a catalog entry:

yaml
# Entity: ML Model
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: bert-sentiment
  namespace: ml-models
  title: BERT Sentiment Classifier
  description: Fine-tuned BERT for sentiment analysis
  annotations:
    inet/model-id: "bert-sentiment-v3"
    inet/framework: "pytorch"
    inet/inference-endpoint: "https://bert-api.internal"
    inet/mlflow-experiment: "sentiment-analysis"
spec:
  type: ml-model
  owner: nlp-team
  lifecycle: production
  providesApis:
    - bert-sentiment-api
  dependsOn:
    - resource:ml-artifact-store

Teams browse this catalog in Backstage. They see model lineage, past versions, and who owns it.

Tech Docs for Model Documentation

Backstage includes tech docs. You write Markdown documentation for models:

markdown
# BERT Sentiment Model
 
## Overview
 
Fine-tuned BERT for binary sentiment classification.
 
## Performance
 
- Accuracy: 94.2%
- F1 Score: 0.943
- Latency: 45ms (p95)
 
## Input Format
 
```json
{
  "text": "This product is amazing!",
  "max_length": 512
}
```

Output Format

json
{
  "label": "positive",
  "confidence": 0.98
}

Training Data

  • Dataset: IMDB reviews
  • Size: 50k examples
  • Split: 80/20 train/test

Dependencies

  • transformers: 4.25.1
  • torch: 2.0.0
  • pytorch-ddp-advanced-distributed-training)-lightning: 2.0

Deployment History

  • v3: Feb 27, 2026 - 94.2% accuracy
  • v2: Feb 15, 2026 - 93.8% accuracy
  • v1: Jan 30, 2026 - 92.1% accuracy

This lives in your docs and is searchable from Backstage.

### Custom Scaffolder Templates for ML Project Bootstrap

When a new team starts an ML project, they use a Backstage scaffolder template to bootstrap it:

```yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: ml-project-bootstrap
  title: ML Project Bootstrap
spec:
  owner: platform-team
  type: service
  parameters:
    - title: Project Details
      properties:
        projectName:
          type: string
          description: Project name (e.g., 'bert-sentiment')
        modelType:
          type: string
          enum: ['nlp', 'vision', 'tabular', 'reinforcement']
        framework:
          type: string
          enum: ['pytorch', 'tensorflow', 'sklearn']
  steps:
    - id: create-repo
      name: Create Repository
      action: publish:github
      input:
        repoUrl: github.com?owner=my-org&repo=${{ parameters.projectName }}
        template: ml-project-template-${{ parameters.framework }}
    - id: register-catalog
      name: Register Catalog Entry
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.create-repo.output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

This scaffolder generates a complete project structure:

bert-sentiment/
├── data/
│   ├── raw/
│   └── processed/
├── notebooks/
│   └── exploration.ipynb
├── src/
│   ├── train.py
│   ├── evaluate.py
│   └── inference.py
├── tests/
│   ├── test_train.py
│   └── test_inference.py
├── config/
│   ├── hyperparams.yaml
│   └── pipeline-fundamentals).yaml
├── Dockerfile
├── requirements.txt
├── catalog-info.yaml
└── README.md

Teams start with best practices built in.

Multi-Tenancy and Fair Resource Sharing

Multiple teams use your platform. They need isolation, fair quotas, and spending transparency.

Kubernetes ResourceQuota Per Team Namespace

Each team gets a namespace with hard resource limits:

yaml
# Namespace for nlp-team
apiVersion: v1
kind: Namespace
metadata:
  name: nlp-team
---
# ResourceQuota to enforce limits
apiVersion: v1
kind: ResourceQuota
metadata:
  name: nlp-team-quota
  namespace: nlp-team
spec:
  hard:
    requests.cpu: "64"
    requests.memory: "512Gi"
    limits.cpu: "128"
    limits.memory: "1024Gi"
    pods: "1000"
    requests.storage: "2Ti"
  scopeSelector:
    matchExpressions:
      - operator: In
        scopeName: PriorityClass
        values: ["normal", "batch"]
---
# NetworkPolicy to prevent team cross-communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nlp-team-isolation
  namespace: nlp-team
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: nlp-team
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: nlp-team
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 53 # DNS

GPU Quota Management with LimitRange

GPUs are expensive. You enforce per-team and per-job limits:

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: nlp-team
spec:
  limits:
    - type: Container
      max:
        nvidia.com/gpu: "4" # Max 4 GPUs per job
      min:
        nvidia.com/gpu: "0"
      default:
        nvidia.com/gpu: "1"
      defaultRequest:
        nvidia.com/gpu: "0"
---
# Pod disruption budget to prevent eviction during training
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: training-jobs-pdb
  namespace: nlp-team
spec:
  minAvailable: 1
  selector:
    matchLabels:
      job-type: training

Fair-Share Scheduling via KAI Scheduler Priority Queues

Kubernetes has a built-in scheduler, but it doesn't understand fairness for ML workloads. KAI Scheduler (or Volcano) provides priority queues:

yaml
apiVersion: scheduling.incubator.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: bert-training-group
  namespace: nlp-team
spec:
  scheduleTimeoutSeconds: 86400
  minMember: 8 # Wait until 8 pods are scheduled together (for distributed training-parallelism)-automated-model-compression)))
---
apiVersion: scheduling.incubator.k8s.io/v1alpha1
kind: Queue
metadata:
  name: ml-workloads
spec:
  reclaimable: true
  weight: 100
  capability:
    cpu: 100
    memory: 1000Gi
---
# High-priority interactive jobs (notebooks)
apiVersion: scheduling.incubator.k8s.io/v1alpha1
kind: Queue
metadata:
  name: interactive
spec:
  reclaimable: false
  weight: 200
  capability:
    cpu: 32
    memory: 256Gi

When the cluster fills up, interactive jobs (notebooks) get scheduled first. Batch training jobs are queued fairly.

Understanding Platform Adoption Patterns

Before we talk about evolution, let's understand why platforms succeed or fail. Most platform failures happen for predictable reasons:

The Complexity Trap: Teams build a perfect platform that handles every possible use case. By the time it's done, it's so complex that only platform engineers understand it. Scientists spend more time learning the platform than actually building models. Lesson: start simple.

The Migration Barrier: You build a wonderful platform, but every team needs to rewrite their entire codebase to use it. The cost of migration exceeds the benefit. Lesson: provide migration paths, not rewrites. Make it trivially easy to start using your platform for new work.

The Adoption Gap: You build a platform, but adoption is slow because teams don't trust it. They've had bad experiences with "platform" tools before. You need to build trust through small wins, not ambitious launches. Lesson: start with your most sympathetic user (usually your own team), deliver value quickly, then expand.

The Snowflake Problem: Your platform tries to support every framework, every workflow, every edge case. It becomes unmaintainable. Now you need dedicated platform engineers just to keep the lights on. Lesson: focus on core workflows. Let edge cases be handled by power users directly.

Platform Evolution Strategy

You don't build a platform in one sprint. You evolve it.

Starting Point: Build on Kubeflow or MLflow Foundation

Don't reinvent the wheel. Start with:

Either is a solid foundation. The key is choosing one aligned with your team's expertise.

python
# Using MLflow as foundation
from mlflow.deployments import get_deploy_client
from mlflow import MlflowClient
 
# List all available models
client = MlflowClient()
models = client.search_registered_models()
 
for model in models:
    print(f"Model: {model.name}")
    for version in model.latest_versions:
        print(f"  Version {version.version}: {version.current_stage}")

Build Thin Abstractions Over Time

Don't wrap everything immediately. Identify pain points, then build abstractions:

  1. Week 1-2: Teams submit Argo Workflows manually
  2. Week 3-4: Too much YAML? Build the CLI wrapper
  3. Month 2: GPU quota disputes? Implement quota management
  4. Month 3: Scientists want to browse experiments? Build the web UI

Each abstraction solves a real problem.

Avoid Lock-In Through Adapter Patterns

Use the adapter pattern so switching backends is easy:

python
# Abstract storage interface
from abc import ABC, abstractmethod
 
class ArtifactStore(ABC):
    @abstractmethod
    def save(self, path: str, local_file: str) -> str:
        pass
 
    @abstractmethod
    def load(self, path: str, local_file: str) -> None:
        pass
 
    @abstractmethod
    def list(self, prefix: str) -> List[str]:
        pass
 
# S3 implementation
class S3ArtifactStore(ArtifactStore):
    def __init__(self, bucket: str, endpoint: str):
        self.s3_client = boto3.client('s3', endpoint_url=endpoint)
        self.bucket = bucket
 
    def save(self, path: str, local_file: str) -> str:
        self.s3_client.upload_file(local_file, self.bucket, path)
        return f"s3://{self.bucket}/{path}"
 
    def load(self, path: str, local_file: str) -> None:
        self.s3_client.download_file(self.bucket, path, local_file)
 
    def list(self, prefix: str) -> List[str]:
        response = self.s3_client.list_objects_v2(
            Bucket=self.bucket,
            Prefix=prefix
        )
        return [obj['Key'] for obj in response.get('Contents', [])]
 
# MinIO implementation (just swap it out)
class MinIOArtifactStore(S3ArtifactStore):
    def __init__(self, bucket: str, endpoint: str):
        super().__init__(bucket, endpoint)
 
# Usage (no changes needed)
store = S3ArtifactStore(bucket="ml-artifacts", endpoint="s3.amazonaws.com")
store.save("bert-v2/model.pt", "/workspace/model.pt")

The rest of your code doesn't care which backend you use.

Measure Platform Adoption and Impact

Success isn't just about features. Measure:

python
# Platform metrics
class PlatformMetrics:
    def __init__(self, db_connection):
        self.db = db_connection
 
    def get_adoption_metrics(self, period_days: int = 30):
        """Measure platform adoption"""
        return {
            "total_jobs_submitted": self.db.query(
                "SELECT COUNT(*) FROM jobs WHERE created > now() - interval '{}d'".format(period_days)
            ),
            "unique_users": self.db.query(
                "SELECT COUNT(DISTINCT user_id) FROM jobs WHERE created > now() - interval '{}d'".format(period_days)
            ),
            "total_gpu_hours": self.db.query(
                "SELECT SUM(duration_minutes * gpu_count / 60) FROM jobs WHERE created > now() - interval '{}d'".format(period_days)
            ),
            "avg_job_runtime_minutes": self.db.query(
                "SELECT AVG(duration_minutes) FROM jobs WHERE created > now() - interval '{}d'".format(period_days)
            ),
            "time_to_first_training_job": self.db.query(
                "SELECT AVG(days_to_first_submission) FROM user_onboarding_metrics"
            ),
        }
 
metrics = PlatformMetrics(db)
adoption = metrics.get_adoption_metrics(period_days=30)
 
print("""
Platform Metrics (Last 30 Days)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Jobs Submitted:        {total_jobs_submitted}
Unique Users:          {unique_users}
Total GPU Hours:       {total_gpu_hours:,.0f}
Avg Job Runtime:       {avg_job_runtime_minutes:.1f} min
Time to First Job:     {time_to_first_training_job:.1f} days
""".format(**adoption))
 
# Output:
# Platform Metrics (Last 30 Days)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Jobs Submitted:        1247
# Unique Users:          42
# Total GPU Hours:       8493
# Avg Job Runtime:       87.3 min
# Time to First Job:     2.4 days

Track these metrics monthly. If adoption is stalling, ask why. Maybe your CLI is hard to use. Maybe the docs are lacking. Iterate based on feedback.

Complete Example: End-to-End ML Workflow

Let's tie it all together with a realistic example.

Step 1: Bootstrap a New Project

bash
# Data scientist creates new project via Backstage
# The scaffolder template generates:
#   - Git repository
#   - Project structure
#   - Docker build file
#   - Catalog entry
 
# Locally, they clone and start developing
git clone https://github.com/my-org/bert-sentiment.git
cd bert-sentiment
 
# Their IDE has -iNet platform helpers
# (LSP plugin, VS Code extension, PyCharm plugin)

Step 2: Train Locally, Then Scale

python
# train.py - Works in notebook and on platform
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from inet_platform import ArtifactStore, ExperimentTracker
 
# Configure tracking
mlflow.start_run(run_name="v3-training")
 
# Training loop (simplified)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
 
for epoch in range(3):
    for batch in train_loader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
 
    # Log metrics
    val_loss = evaluate(model, val_loader)
    mlflow.log_metrics({"val_loss": val_loss, "epoch": epoch})
    print(f"Epoch {epoch}: val_loss={val_loss:.3f}")
 
# Save model
model.save_pretrained("checkpoints/final")
mlflow.pytorch.log_model(model, "model")
 
# Save to artifact store
store = ArtifactStore()
store.save_artifact(
    "bert-sentiment/v3/model",
    "checkpoints/final",
    metadata={"accuracy": 0.942, "f1": 0.943}
)

Step 3: Submit to Platform for Large-Scale Training

bash
# When ready to train on 8 GPUs with full dataset
inet platform train \
  --name "bert-sentiment-v3-full" \
  --script "train.py" \
  --image "ml-training:latest" \
  --gpu 8 \
  --memory "64Gi" \
  --distributed-backend "pytorch" \
  --output-artifacts "checkpoints/final" \
  --timeout-hours 48
 
# Output:
# Job submitted: bert-sentiment-v3-full-xyz789
# GPU allocation: 8/8 (100% of team quota)
# Expected duration: 24-36 hours
#
# Monitor with:
#   inet platform status bert-sentiment-v3-full-xyz789
#   inet platform logs bert-sentiment-v3-full-xyz789

Platform automatically:

  • Generates Argo Workflow with 8 GPU replicas
  • Mounts the artifact store
  • Sets up MLflow tracking
  • Configures distributed training (NCCL, Gloo)
  • Implements checkpointing and restart logic

Step 4: Review Results in Web UI

The team opens the web dashboard:

┌─────────────────────────────────────────┐
│ BERT Sentiment Experiments              │
├─────────────────────────────────────────┤
│                                         │
│ bert-sentiment-v3-full-xyz789           │
│ Status: Complete ✓                      │
│ Runtime: 28h 45m                        │
│ GPU Hours: 230                          │
│ Val Loss: 1.38                          │
│                                         │
│ [View Metrics] [Compare Versions]      │
│ [Deploy to Staging] [View Logs]         │
│                                         │
└─────────────────────────────────────────┘

Step 5: Deploy to Production

python
from inet_platform import ModelRegistry, InferenceEndpoint
 
# Register model
registry = ModelRegistry()
registry.register_model(
    name="bert-sentiment",
    version="v3",
    artifact_path="s3://ml-artifacts/bert-sentiment/v3/model",
    framework="pytorch",
    description="Sentiment classification with 94.2% accuracy"
)
 
# Deploy inference endpoint
endpoint = InferenceEndpoint.create(
    model_name="bert-sentiment",
    model_version="v3",
    replicas=3,  # 3 inference pods for HA
    gpu_per_replica=0,  # CPU inference is fine for this model
    max_batch_size=32
)
 
print(f"Model live at: {endpoint.url}")
# Output: https://bert-sentiment-api.internal/predict

Now your production team can call:

bash
curl -X POST https://bert-sentiment-api.internal/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "This product is amazing!"}'
 
# Response:
# {
#   "label": "positive",
#   "confidence": 0.98,
#   "inference_time_ms": 45
# }

The Impact

What does a platform like this give you?

For Data Scientists: They go from wrestling with infrastructure to training models in one command. Iteration time drops from days to hours.

For ML Ops Engineers: They manage resources centrally, track spending, and enforce quotas without manually intervening for every job.

For the Organization: Models reach production faster. You avoid snowflake infrastructure. Costs are transparent and controlled.

For Reliability: Multi-tenancy prevents one team from destabilizing others. Quotas prevent runaway spending. Logging and metrics enable debugging.

The platform doesn't replace domain expertise. It amplifies it. Your team spends time on models, not plumbing.

Case Study: From Chaos to Velocity in Nine Months

A data science organization we worked with had grown to forty people without any internal platform. Each scientist provisioned their own cloud resources, managed their own data, and deployed models manually. The results were predictable: cost was out of control, models were deployed inconsistently, and data was scattered across five different AWS accounts with no governance. A scientist wanting to use another scientist's preprocessed data had to ask for credentials and manually download files. Training a model meant figuring out resource provisioning from scratch. Deploying to production involved a manual handoff to the ops team and twenty minutes of waiting.

They decided to build an internal platform. Their first iteration took four months. It wasn't fancy. They took MLflow, added a thin Python CLI wrapper, and deployed it on Kubernetes. Training jobs went from manual provisioning to "inet platform train --gpu 4 --output model.pt". Simple. Scientists loved it immediately. Time to first training job dropped from thirty minutes to two minutes. That alone was a productivity win.

By month six, they'd added model registry integration. By month nine, they had experiment comparison, cost tracking, and automatic deployment-production-inference-deployment) to staging. The impact compounded. A team that could run five training experiments per week could now run twenty. Models iterated faster. Product teams got better recommendations. Revenue impact was measurable.

Breaking Down the Real-World Benefits

Let's ground this in concrete numbers. A company we worked with had a data science team of 25 people spending an average of 12 hours per week each wrestling with infrastructure, provisioning, and debugging deployment issues. That's 300 person-hours per week, or roughly 15,600 hours per year. At an average burdened cost of $150/hour, that's $2.34 million annually in lost productivity.

After deploying their internal platform, infrastructure overhead dropped to 3 hours per week per person. The 9-hour savings translates to about $1.75 million in recovered productivity. The platform engineering team (5 people) costs roughly $750,000 in annual salary. The net benefit in year one was $1 million in recovered productivity, with ongoing returns in subsequent years.

But the financial benefit isn't the full story. The team's velocity increased dramatically. Before the platform, a new data scientist needed 2-3 weeks to set up their development environment, understand the infrastructure, and run their first production training job. After the platform, they could do it in a day. That compressed onboarding time has real value—new team members are productive faster, and your hiring becomes less of a constraint on growth.

The Platform's Role in Model Reliability

Beyond velocity, there's a reliability angle that often gets overlooked. When your infrastructure is scattered—some models on one person's laptop, some on AWS, some on on-prem hardware—you have no visibility into what's running, what's failing, or what it costs. A data scientist's model runs away and racks up $50,000 in compute cost. Nobody notices until the bill arrives.

With a centralized platform, you have complete visibility. Every job logged, every cost tracked, every failure surfaced immediately. This isn't just about cost control—it's about preventing catastrophic failures. You can set hard limits on resource usage. You can alert when a training job exceeds its expected budget. You can automatically terminate runaway processes.

This visibility also helps with compliance and audit. Regulated industries need to know exactly what models are running, when they were trained, on what data, by whom. A scattered infrastructure makes this nearly impossible. A platform makes it automatic. Every training run is logged with metadata, every model is registered with lineage, every inference is traceable.

The Hidden Cost of Not Having a Platform

Finally, consider the inverse. What does it cost you to not have a platform?

First, there's the technical debt. Engineers spend time building custom training scripts, custom deployment tools, custom monitoring. These tools work, but they're unmaintained, undocumented, and fragile. When someone leaves, that knowledge walks out the door.

Second, there's the bottleneck. Your infrastructure team becomes a gating factor for progress. Scientists can't spin up resources without human approval. Deployments require manual steps and handoffs. A simple model update becomes a multi-day process.

Third, there's the risk. Without governance, you end up with models in production that nobody understands. Data drift goes undetected. Models degrade silently. You deploy a model that turns out to have an unintended bias because nobody ran adequate testing.

A platform fixes all of this by making good practices the path of least resistance. Testing becomes automatic. Data validation becomes automatic. Model monitoring becomes automatic. Not because you're forcing it, but because the platform makes it the easiest path forward.

Why Platforms Matter Now More Than Ever

The ML infrastructure-flux)-flux) landscape is getting more complex every year. Three years ago, you could train a model on your laptop, upload it to S3, and call an API to serve it. Today, you need to orchestrate distributed training across GPUs, manage complex dependencies, track experiments, version data, monitor drift, integrate with feature stores)), and handle model governance.

Without a platform abstracting this complexity, you're asking each data scientist to become a DevOps engineer. That's a waste of their talents. A good platform frees them to focus on modeling, not plumbing.

Integration with Data Quality and Compliance

As your platform grows, data quality and compliance become critical. Your training data needs to be auditable. Who provided this training data? When? What preprocessing was applied? Did we get the necessary consents? In regulated industries, this isn't academic. It's required by law. A model trained on data you didn't have permission to use can create legal liability.

Your platform needs to track lineage. Every training run should know which raw data it started from, what processing it went through, and what the final dataset consisted of. Every model deployment should know which training run produced it, which data it was trained on, and what its performance characteristics are. This metadata isn't just useful for debugging. It's required for compliance audits.

Data quality gates are equally important. Before a dataset reaches a model training job, it should pass quality checks. Are there suspicious null values? Are feature distributions reasonable? Are there duplicates? Did we detect data drift that would cause retraining? These checks should be automated and run on every dataset. When a check fails, the job should be blocked with a clear error message explaining what went wrong.

Common Platform Anti-Patterns to Avoid

As you build your platform, watch out for these common pitfalls:

Building for Scientists Instead of With Them: You sit in engineering and design a platform you think scientists want. Six months later, nobody uses it because it doesn't match their workflows. Better approach: involve scientists from day one. Show prototypes, gather feedback, iterate based on what they tell you.

Making the Platform Too Opinionated: You enforce a specific workflow, a specific framework, a specific deployment pattern. Scientists chafe against the constraints. Better approach: provide sensible defaults, but allow escape hatches. Let power users opt out of abstractions when needed.

Ignoring Operational Burden: You build a platform that works great when everything goes right. But what happens when a training job crashes? When a model breaks? When someone accidentally depletes the GPU budget? If your platform doesn't handle failure gracefully, operators will hate it.

Forgotten Documentation: Your platform is powerful but poorly documented. Scientists waste hours figuring out how to do basic tasks. They stop using it and go back to their old workflows. Documentation isn't a nice-to-have—it's essential.

Long-Term Sustainability and Cost Management

Platform infrastructure costs money. Kubernetes clusters, PostgreSQL databases, artifact storage, compute for training jobs—it all has a price tag. Over time, these costs can become significant. A platform that costs five hundred thousand dollars per year is only valuable if it enables more value than that. Most platforms easily clear this threshold when properly utilized. But you need to understand your costs and track them obsessively.

Implement detailed cost tracking in your platform. Tag every job with its owner. Track how much compute it used. Track how much storage it consumed. When your team runs a thousand GPU-hours of training, they should know that cost and be accountable for it. This drives efficient behavior. Scientists running thousands of experiments with bad hyperparameters suddenly become more thoughtful when they see the compute cost.

Implement cost controls. Set per-team budgets. If a team exceeds their budget, they get warnings. If they grossly exceed it, training jobs start getting rejected. Cost controls prevent runaway spending from a single mistake. One scientist could accidentally submit a poorly configured training job that burns money for a month if nobody's watching.

Building Organizational Competency

One often-overlooked aspect of platform success is building organizational knowledge. You can have the best infrastructure in the world, but if people don't know how to use it effectively, it's worthless. Invest in documentation. Build tutorials. Run workshops. Create runbooks for common scenarios. Share patterns. When someone figures out a clever way to use the platform, document it and share it with the team.

Create a platform community within your organization. Have regular office hours where scientists can ask questions. Host platform showcase sessions where teams share interesting uses. Build a Slack channel for platform questions and celebrate good answers. Make it fun and rewarding to learn the platform. Create heroes and champions. When someone becomes expert in a platform feature, recognize that expertise.

Rotate on-call responsibilities for platform support. Everyone on the core platform team should have a week on-call per quarter. This ensures that platform knowledge spreads and that nobody becomes a single point of failure. It also builds empathy. When you're on call and a scientist's training job is blocked because of a platform bug, you feel the pain. You're motivated to fix it.

Next Steps

Start small. Pick one pain point (maybe experiment tracking is chaotic). Pick one tool (MLflow, Kubeflow, whatever your team knows). Build a thin wrapper. Gather feedback. Iterate.

The platform grows with your needs. By month six, you'll have something that feels like magic to your team. By month twelve, you'll wonder how you ever worked without it.

Measuring Success

How do you know if your platform is working? Track these metrics:

  • Time to first training job: How long does it take a new scientist to submit their first job? Ideally under 1 hour.
  • Adoption rate: What percentage of your scientists are actively using the platform?
  • Experiment velocity: How many experiments per scientist per week? Platforms should increase this.
  • Cost per model: Are your infrastructure costs going down as you optimize the platform?
  • Time to production: How long from model training to production deployment?
  • Mean time to recovery: When something breaks, how quickly can you fix it?

If these metrics are improving, your platform is working. If they're stagnant or declining, dig deeper and understand why.


The Compounding Value of Platform Investment

One thing becomes clear when you step back from the implementation details: a well-built ML platform is one of the highest-ROI infrastructure investments you can make. Not because it's technically sophisticated, but because it amplifies your team's effectiveness across every dimension. Faster time to experiment means better models. Better models mean better products. Better products mean competitive advantage and revenue.

The platform becomes especially valuable as your organization scales. With ten data scientists and ad-hoc infrastructure, the mess is manageable. With fifty scientists and ad-hoc infrastructure, you've got chaos. Your infrastructure team spends all their time fighting fires instead of building. Your scientists spend all their time wrestling systems instead of building models. Everyone is frustrated. The platform breaks this cycle by introducing structure and automation at the right level of abstraction.

Technical Debt and Platform Evolution

Even well-designed platforms accumulate technical debt. You choose Django over FastAPI because Django was faster to build. Later you realize Flask would have been better. You choose Kafka without considering whether your team had operations expertise. Now you're running Kafka but keeping two on-call rotations because management of Kafka is complex and error-prone. You chose a specific vector database because it had the best benchmark scores. Later you discover it doesn't scale to your query volume.

These decisions aren't failures. They're inevitable in fast-moving organizations. What matters is treating them as design decisions, not accidents. When you discover that a component isn't working well, build a migration path rather than throwing it all away. The cost of throwing away everything and starting over is usually higher than the cost of migrating to something better. If you choose Kafka and it becomes too complex, migrate to RabbitMQ or AWS Kinesis rather than abandoning the platform entirely.

Technical debt in platforms compounds because platforms are shared infrastructure. When the batch job framework has a bug, every team's batch jobs suffer. When the experiment tracking system loses data, every team lost data. Invest in reliability and testing for platform components more heavily than you would for application code. Your platform's bugs affect more people. Your platform's downtime blocks more work. The ROI on platform testing is higher.

Building Buy-In Across Your Organization

Success with platforms often comes down to organizational dynamics more than technology. You can build the most elegant platform ever, but if your team doesn't trust it or doesn't understand how to use it, adoption will fail. The antidote is involving your users in design from the start. Run interviews with data scientists about their pain points. Show prototypes and get feedback. Build the platform with your team, not for your team.

The Long Game: Platform as Competitive Moat

Companies that excel at AI don't necessarily have smarter data scientists than everyone else. They have better infrastructure. They can run more experiments faster. They can deploy models more reliably. They can operate systems more efficiently. Infrastructure that enables this becomes a competitive advantage that's hard to replicate. It's not patented. It's not proprietary algorithms. It's the accumulated expertise and tooling that makes AI happen faster in your organization than in others.

This is why great AI companies invest heavily in MLOps and platform infrastructure. It's not because they love infrastructure for its own sake. It's because they recognize that platform quality directly translates to product quality, model quality, and competitive positioning. When your team can experiment fifty times per week instead of five, and each experiment leads to measurable improvements, you win.


Building infrastructure that disappears into the background so your ML team can focus on what matters.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project