April 24, 2025
AI/ML Infrastructure Fundamentals Architecture

ML System Design Document Template: From Requirements to Architecture

You've built a killer ML model in a notebook. Awesome. But now you need to ship it to production, and that's where things get real. A solid ML system design document is your roadmap - it bridges the gap between "this works in my experiment" and "this scales reliably in production."

In this article, we'll walk through the anatomy of an ML system design doc, why each section matters, and provide a fillable template you can adapt for your own projects. We'll also work through a real example using a recommendation system to show you how these pieces fit together.

Table of Contents
  1. Why You Need an ML System Design Document
  2. The Real Cost of Skipping the Design Phase
  3. Building a Shared Language Across Disciplines
  4. The Compounding Value of Documentation
  5. The Anatomy of a Complete ML System Design
  6. 1. Problem Statement
  7. 2. Success Metrics (ML + Business KPIs)
  8. 3. Data Requirements
  9. 4. Model Architecture
  10. 5. Infrastructure Architecture
  11. 6. Trade-Off Analysis Matrix
  12. 7. Risk Assessment and Mitigation
  13. 8. Rollout Plan
  14. Worked Example: E-Commerce Recommendation System
  15. Problem Statement
  16. Success Metrics
  17. Data Requirements
  18. Model Architecture
  19. Infrastructure Architecture
  20. Trade-Off Analysis
  21. Risk Assessment
  22. Rollout Plan
  23. Fillable ML System Design Template
  24. 1. Problem Statement
  25. 2. Success Metrics
  26. ML Metrics
  27. Business Metrics
  28. 3. Data Requirements
  29. 4. Model Architecture
  30. 5. Infrastructure Architecture
  31. 6. Trade-Off Analysis
  32. 7. Risk Assessment
  33. 8. Rollout Plan
  34. Request Flow Diagram (Real-Time Inference)
  35. Monitoring and Incident Response
  36. The Incident Response Mindset
  37. ML-Specific Requirements Deep Dive
  38. Framing the Prediction Task
  39. Precision vs. Recall Trade-Offs in Context
  40. Training Data Freshness and Concept Drift
  41. Acceptable Degradation Gradient
  42. Advanced Topic: Multi-Objective Optimization
  43. Measuring Model Fairness
  44. Key Takeaways
  45. What Comes Next

Why You Need an ML System Design Document

Here's the thing: ML systems are different from traditional software systems. You can't just code review your way to confidence. Your model might be mathematically sound but trained on biased data. Your infrastructure might scale, but your model might degrade unpredictably over time.

An ML system design document forces you to think through:

  • What problem are we actually solving? (Not just "build a classifier," but "reduce false positives by 30% without sacrificing recall")
  • How do we measure success? (Both ML metrics AND business KPIs)
  • What can go wrong? (And what's our playbook for fixing it?)
  • How do we evolve the system? (Monitoring, retraining, incident response)

Without these answers locked in before you write production code, you'll be firefighting every day.

The design document serves as your team's contract - it's what you reference when someone asks "why did we choose Kubernetes instead of Lambda?" or "how do we handle model degradation?" A well-written design doc prevents you from making the same architectural mistakes twice.

The Real Cost of Skipping the Design Phase

Many teams believe that writing a design document is overhead. They'd rather "move fast" and iterate in production. This is usually a costly mistake. The teams that ship fastest are the ones that spent time thinking through the hard problems upfront. Not because thinking is faster than building - it's not - but because the cost of changing decisions in production is exponentially higher than the cost of changing them in a design document.

Consider a team that decides mid-production to switch from REST to gRPC for their inference API. They've already built client libraries in three languages. They've trained all their users on the REST format. Changing now means rewriting service mesh configurations, updating all downstream services, retraining the ops team on new debugging tools, and potentially breaking production for hours during the migration. All of this pain could have been avoided by a thirty-minute conversation in the design phase.

Or consider a team that built their feature store without thinking about feature freshness requirements. Three months into production, they discover that their daily batch updates are causing stale features, which is degrading model performance. Now they're scrambling to build real-time feature computation, a complex undertaking that derails other priorities. A good design doc would have forced them to think through the "how fresh do features need to be?" question before they wrote the first line of code.

The design document is your insurance policy against these expensive mistakes. It's the time you spend thinking so you don't spend months fixing.

Building a Shared Language Across Disciplines

Another underrated benefit of a good design doc is that it creates a shared understanding across your entire team. Your data scientists speak in the language of metrics and models. Your infrastructure engineers speak in the language of scalability and reliability. Your product managers speak in the language of business impact. Without a design document, these groups are often talking past each other.

The design document forces a translation. When your data scientist proposes using a deep neural network and your infrastructure engineer says "that has too much latency," the design document is where you resolve this tension. You look at the latency requirements, the accuracy requirements, and you make an informed trade-off together. The neural network might win because 1% accuracy improvement is worth 200ms more latency. Or XGBoost might win because your latency SLA is non-negotiable. But you decide together, on paper, with clear reasoning.

This shared understanding is worth its weight in gold when you hit production issues. When the model starts degrading, your team doesn't argue about whether to retrain or investigate drift. The design doc already spelled out the degradation thresholds and the playbook. You just follow the plan.

The Compounding Value of Documentation

Here's something most engineers underestimate: the value of good documentation compounds over time. When you onboard a new engineer six months from now, they don't need to ask "why did we choose this architecture?" They can read the design doc and understand the decision-making process. When you're evaluating whether to upgrade a dependency, you reference the design doc to see what constraints matter. When you're planning the next iteration of the system, you read the design doc to understand what problems the current version was trying to solve and how well it's working.

The cost of writing a design doc is paid in the first week. The benefits accrue over months and years. Most teams underestimate how much time good documentation saves them in the long run.

The Anatomy of a Complete ML System Design

Let's break down the essential sections and why each one matters.

1. Problem Statement

Start here. What are we building and why?

Your problem statement should answer:

  • What's the current pain point? Be specific. "Users don't find relevant items" is vague. "Our search engagement rate is 12%, down from 18% last year, because recommended items don't match user intent" is actionable.
  • What business outcome are we targeting? Increased engagement? Reduced churn? Faster query response times?
  • Who benefits? Users, the business, a specific team?
  • Why now? What's changed that makes this problem worth solving?

Example: "Our e-commerce platform recommends products to 2M weekly active users. Current recommendations rely on collaborative filtering with a 7-day model refresh cycle. Engagement on recommendations has dropped to 8% as our catalog grew to 500K items. We want to increase engagement to 12% and reduce customer churn by 2% within Q2 by implementing a real-time hybrid recommendation system that combines content-based and collaborative signals."

See? Concrete. Measurable. Tied to business value.

A strong problem statement also articulates the urgency and scope. Is this a quick win that could be done in a sprint, or is this a three-month architectural overhaul? Setting these expectations prevents scope creep and ensures alignment with stakeholders. Many teams skip this step or write something so vague it's useless. Don't. Spend an hour getting the problem statement right. It should be the thing you come back to when you're debating design trade-offs six weeks in.

2. Success Metrics (ML + Business KPIs)

You need to track two categories of metrics: what the model does, and what the business cares about.

ML Metrics tell you how well your model performs:

  • Precision / Recall / F1-score (classification)
  • Mean Absolute Error / RMSE (regression)
  • Recommendation diversity, serendipity, novelty (ranking)
  • Latency (inference time)
  • Model staleness (how old is the training data?)

Business Metrics tell you if you're actually creating value:

  • Engagement rate (clicks, time spent)
  • Conversion rate
  • Customer lifetime value
  • Churn reduction
  • Cost savings (e.g., reduced infrastructure spend)

Here's what trips up most teams: they optimize for ML metrics in isolation. "We achieved 95% accuracy!" Meanwhile, your business metric stayed flat because you're not addressing the right problem.

Define your thresholds upfront. What precision/recall tradeoff is acceptable? Is 91% precision with 75% recall better than 85% precision with 88% recall? That depends on your business context. A spam filter wants high precision (few false positives). A medical screening tool wants high recall (catch every case). Lock this in your design doc.

The subtle art here is avoiding "vanity metrics" - numbers that look good in dashboards but don't actually measure what matters. Click-through rate on recommendations might go up, but if those clicks don't convert to purchases, you've optimized the wrong thing. Define metrics that connect directly to revenue or user satisfaction. And be honest about what you can actually measure. If you can't instrument it in the first week of deployment, reconsider whether it's a metric worth tracking.

3. Data Requirements

ML systems are only as good as their data. Spell out:

  • Data sources: Where does training data come from? What's the pipeline?
  • Data freshness: How old can training data be? Real-time stream or batch daily?
  • Volume: How many training examples do you need? What's your sample size for each class?
  • Features: What raw features exist? Do you need to engineer new ones?
  • Labeling: How do you get ground truth labels? Manual annotation? Weak supervision?
  • Train/validation/test split: How do you avoid data leakage?
  • Class imbalance: Do you have a balanced dataset or do you need resampling?

Be honest about data limitations. "Our dataset has 2M examples, but only 50K labeled positives" changes your modeling strategy entirely.

When you document data requirements, also think about operational realities. What happens if your labeling pipeline breaks down? Do you have enough historical data to retrain if you discover a labeling bug? What's your data retention policy? If you need to comply with regulations like GDPR, how does that affect your training data pipeline?

Data freshness deserves special attention. Real-time data streams are seductive ("we can react instantly!"), but they come with operational complexity: more infrastructure, harder debugging, higher failure modes. Batch daily feels old-fashioned until you realize it's reliable, easy to test, and still fast enough for most use cases. Document your choice and your reasoning. Don't just assume streaming is better because it sounds advanced.

4. Model Architecture

Describe your modeling approach. Don't just say "deep learning" - be specific.

  • Architecture choice: Why this model over alternatives? (Neural network vs. XGBoost vs. linear regression?)
  • Input representation: How do you encode features? One-hot encoding? Embeddings?
  • Hyperparameters: Learning rate, batch size, regularization, etc.
  • Training approach: Supervised, semi-supervised, transfer learning, fine-tuning?
  • Ensemble strategy: Single model or ensemble?

And crucially: document your trade-offs.

The Architecture Decision Framework

Choosing a model architecture is one of the most important decisions you'll make, and it's often rushed. Teams get seduced by cutting-edge approaches or follow the "we should be using deep learning" impulse without thinking deeply about whether it's actually the right choice. The reality is that simpler models are often better in production.

When you're evaluating model architectures, you need to think about multiple dimensions simultaneously. Performance is important, yes, but so is interpretability. A 95% accuracy model that you can't explain is less useful than an 92% accuracy model where you understand exactly why it made each decision. Similarly, a model that achieves great offline metrics but has unacceptable latency in production is worse than a slower-training model that meets your inference time budget.

The key is to articulate your constraints upfront. If your latency requirement is under 50 milliseconds, that immediately rules out many deep learning architectures. If your model needs to be deployable on edge devices with limited memory, you need a fundamentally different approach than if you're deploying on beefy cloud servers. If you need to understand your model's decisions for regulatory compliance, you need to favor interpretable approaches even if they sacrifice a few percentage points of accuracy.

One technique we've found invaluable is building a decision matrix early in the design process. Rather than debating architecture choices verbally, you write down the requirements, list candidate architectures, and score each one against each requirement. This makes the decision visible and lets you see clearly where trade-offs exist.

"We chose a light gradient boosting model (LightGBM) over a deep neural network because:

  • Inference latency requirement: <100ms, LightGBM achieves 35ms, neural net was 450ms
  • Model interpretability: our business stakeholders need to understand why recommendations are made
  • Training speed: LightGBM retrains in 2 hours, deep net would need 6 hours
  • Accuracy: LightGBM achieved 94% AUC, neural net achieved 95% AUC (1% difference not worth the latency hit)"

That's how you document trade-offs. It's not defensiveness; it's clarity. You're showing your work, and that's what good engineers do.

5. Infrastructure Architecture

Now let's talk deployment. You need to describe:

  • Training pipeline: How do you train the model? Batch job? Scheduled service?
  • Feature store: Where do you compute and serve features? Real-time or pre-computed?
  • Model serving: REST API? gRPC? Batch predictions?
  • Monitoring: What metrics do you track in production?
  • Retraining strategy: Weekly? When performance drops? On-demand?

Here's a typical architecture:

Raw Data → Feature Engineering → Feature Store → Model Training → Model Registry
                                      ↓
                              Feature Serving
                                      ↓
                              Online Inference
                                      ↓
                         User Gets Recommendation

6. Trade-Off Analysis Matrix

This is where you show your work. Don't hide behind one choice - show the alternatives you considered and why you rejected them.

Create a table like this:

AspectOption AOption BOption CChosenRationale
Model TypeXGBoostNeural NetLinearXGBoost94% AUC, 35ms latency, interpretable
Feature StoreFeastTectonCustomFeastOpen-source, AWS integration, team expertise
ServingREST APIgRPCBatchREST API95% of requests are single-prediction, simpler ops
Training ScheduleDailyWeeklyOn-demandDailyCatalog updates frequently, staleness risk grows
Cost$8K/month$15K/month$3K/month$8K/month$3K has latency risk, $15K over budget

Showing your reasoning proves you didn't just pick the trendy option.

7. Risk Assessment and Mitigation

What can go wrong? Be paranoid. Here are the big ones:

Data Quality Risks:

  • Training data becomes stale → Implement automated drift detection
  • Distribution shift (your data today differs from yesterday) → Monitor prediction distribution daily
  • Label noise (ground truth labels are wrong) → Use consensus labeling, spot-check accuracy

Model Risks:

  • Model bias (unfair to certain user groups) → Test model fairness monthly, disaggregate metrics by demographic
  • Adversarial examples (users game the system) → Monitor for adversarial patterns, add constraints
  • Model collapse (feedback loop degrades quality) → Separate exploration/exploitation, A/B test new recommendations

Infrastructure Risks:

  • Feature pipeline fails → Fallback to cached features, alert on pipeline latency
  • Model serving becomes a bottleneck → Caching strategy, load testing
  • Retraining job fails → Maintain last N model versions, automatic rollback

Operational Risks:

  • Incident response plan is missing → Document playbook, run incident simulation quarterly
  • Monitoring blindness (you don't see the problem until users complain) → Set up alerting for model performance, latency, data drift

8. Rollout Plan

How do you go from "design" to "in production"? Break it into phases:

Phase 0: Offline Validation

  • Evaluate model on holdout test set
  • Run fairness audits
  • Performance benchmarking

Phase 1: Shadow Mode (Week 1-2)

  • System runs in parallel but doesn't affect users
  • Compare predictions to current system
  • Catch bugs without impact

Phase 2: Canary Rollout (Week 3-4)

  • Show new recommendations to 5% of users
  • Monitor engagement, click-through rate, model latency
  • If metrics improve and latency is good, increase to 10%

Phase 3: Gradual Rollout (Week 5-8)

  • Slowly increase traffic: 25% → 50% → 100%
  • Maintain monitoring dashboards
  • Be ready to rollback at any time

Phase 4: Full Production (Ongoing)

  • 100% of traffic
  • Daily monitoring
  • Weekly retraining

Worked Example: E-Commerce Recommendation System

Let's make this concrete. Here's how you'd design a recommendation system for an e-commerce platform.

Problem Statement

"Our platform has 2M monthly active users and 500K products. Our current collaborative filtering system (trained weekly) recommends items to users on product pages and in email digests. Recommendation engagement (defined as clicks within 5 seconds of viewing) is 8%. We want to increase engagement to 12% and reduce recommendation staleness from 7 days to 1 day by implementing a hybrid recommendation system that combines real-time collaborative signals with content-based matching on new products."

Success Metrics

ML Metrics:

  • Ranking AUC (ability to rank relevant items higher): ≥0.92
  • NDCG@5 (normalized discounted cumulative gain at top 5): ≥0.65
  • Inference latency p99: <150ms
  • Model staleness: <24 hours

Business Metrics:

  • Recommendation click-through rate: increase from 8% to 12% (target: +50%)
  • Engagement time on recommended items: increase from 45s to 60s average
  • Add-to-cart rate from recommendations: increase from 2.1% to 2.8%
  • Customer repeat purchase rate: maintain or improve

Data Requirements

  • Data sources: User click events, product browsing history, purchase history, product metadata (category, price, description)
  • Freshness: Real-time user interactions, product features updated daily
  • Volume: 10M user events/day, 500K products, 2.5B historical user-product interactions
  • Features: User embedding (from browsing history), product embedding (from category, price, description), interaction recency
  • Labeling: Click = positive signal, no-click = negative signal (weak supervision)
  • Train/val/test split: Temporal split (6 months training, 1 month validation, 1 month test) to avoid data leakage

Model Architecture

Chosen: Two-tower neural network with embedding layers

User Tower:
  user_id → embedding lookup → dense(128) → dense(64)

Product Tower:
  product_id → embedding lookup
  product_category → embedding lookup
  product_price → normalize
  concat → dense(128) → dense(64)

Output: dot product of user embedding & product embedding → sigmoid → probability

Rationale:

  • Two-tower architecture enables efficient serving (precompute product embeddings)
  • Embedding layers handle sparse categorical features efficiently
  • Inference is fast (matrix multiplication, not sequence processing)
  • Can be trained end-to-end with click signals as labels

Infrastructure Architecture

Here's the full system:

graph LR
    A["User Events<br/>(clicks, views)"] -->|daily batch| B["Feature Store<br/>(Feast)"]
    C["Product Catalog<br/>(daily updates)"] -->|daily batch| B
    B -->|training dataset| D["Model Training<br/>(weekly)"]
    D -->|artifact| E["Model Registry<br/>(MLflow)"]
    E -->|pull latest| F["Model Server<br/>(FastAPI)"]
    B -->|real-time lookup| F
    F -->|recommendations| G["Frontend<br/>(React)"]
    H["Monitoring<br/>(Prometheus)"] -->|alerts| I["PagerDuty"]
    F -->|logs| H

Component Details:

  1. Feature Store (Feast): Stores user and product embeddings, updated daily from batch job
  2. Training Pipeline (Airflow): Weekly job that pulls training data, trains model, logs metrics
  3. Model Server (FastAPI): Serves recommendations as REST API (target latency: <150ms)
  4. Monitoring (Prometheus): Tracks inference latency, model predictions distribution, model drift

Trade-Off Analysis

DecisionOption AOption BOption CChosenWhy
ModelTwo-tower NNXGBoostCollaborative filtering onlyTwo-tower NNReal-time serving, new product handling, handles cold-start better
Feature FreshnessReal-timeDaily batchWeeklyReal-time user events, daily product featuresBalances freshness with operational complexity
ServingREST APIBatch pre-computegRPCREST API95% single-prediction requests, easier integration
RetrainingDailyWeeklyOn-demandWeeklyWeekly is sufficient, lower operational overhead
InfrastructureManaged ML (Sagemaker)Self-hosted (Kubernetes)Serverless (Lambda)Self-hostedTeam expertise, cost control, custom monitoring

Risk Assessment

RiskImpactMitigation
Cold-start (new users/products)HighContent-based fallback, product metadata features
Data quality issuesHighDaily data quality checks, alert on null rates
Model bias (gender/age bias)MediumMonthly fairness audits disaggregated by user segment
Inference latency spikeMediumCaching strategy, circuit breaker, fallback to cached recommendations
Feedback loop (popularity bias)MediumExploration set (20% random recommendations), monitor coverage
Training job failureMediumKeep last 3 model versions, automatic rollback
Concept drift (user preferences change)LowDaily monitoring of AUC, alert if drops >5%

Rollout Plan

Phase 0 (Week 1): Offline validation on holdout test set

  • Target: NDCG@5 ≥0.65, ranking AUC ≥0.92
  • Fairness check: AUC difference between demographic groups <2%

Phase 1 (Week 2): Shadow mode

  • Run both old and new systems, log recommendations from both
  • Compare: does new system have better ranking of clicked items?

Phase 2 (Week 3): Canary (5% of traffic)

  • Monitor engagement rate, latency, model drift
  • Success criteria: engagement ≥9%, latency <150ms

Phase 3 (Week 4-6): Gradual rollout (10% → 25% → 50%)

  • Daily monitoring dashboards
  • Ready to rollback if engagement drops or latency exceeds 200ms

Phase 4 (Week 7+): Full production

  • 100% traffic on new system
  • Daily retraining enabled
  • Weekly monitoring reviews

Fillable ML System Design Template

Copy this template and fill in your own system details:

markdown
# ML System Design Document
 
**Title**: [Project Name]
**Author**: [Name]
**Date**: [YYYY-MM-DD]
**Status**: [Draft/Review/Approved]
 
## 1. Problem Statement
 
What is the business problem we're solving?
 
- Current state: [Describe current situation]
- Target state: [What success looks like]
- Why now: [What's changed?]
- Expected impact: [Business metrics]
 
## 2. Success Metrics
 
### ML Metrics
 
- [Metric 1]: Target ≥ [threshold]
- [Metric 2]: Target ≤ [threshold]
- Latency requirement: [p99 in ms]
 
### Business Metrics
 
- [KPI 1]: Target [X%] increase
- [KPI 2]: Target [X%] improvement
 
## 3. Data Requirements
 
- **Sources**: [Where does data come from?]
- **Volume**: [Number of examples, size]
- **Freshness**: [How often updated?]
- **Features**: [List key features]
- **Labels**: [How do we get ground truth?]
- **Quality Issues**: [Known limitations]
 
## 4. Model Architecture
 
- **Model Type**: [Type of model]
- **Input Shape**: [Dimensions, format]
- **Output**: [What does model predict?]
- **Key Components**: [Layers, attention mechanisms, etc.]
 
**Rationale**: [Why this architecture over alternatives?]
 
## 5. Infrastructure Architecture
 
- **Training**: [Schedule, infrastructure, code location]
- **Serving**: [API, batch, format]
- **Feature Store**: [Which system? How updated?]
- **Monitoring**: [What metrics? Which tools?]

[Include architecture diagram here]

6. Trade-Off Analysis

DimensionOption AOption BChosenRationale
[e.g., Model][Option 1][Option 2][Choice][Reasoning]

7. Risk Assessment

RiskLikelihoodImpactMitigation
[Risk Name]High/Med/LowHigh/Med/Low[How to prevent/handle]

8. Rollout Plan

  • Phase 0: [Validation steps]
  • Phase 1: [Shadow mode / initial test]
  • Phase 2: [Canary rollout]
  • Phase 3: [Gradual rollout]
  • Phase 4: [Full production]

Rollback Plan: [How do we go back?]


## Request Flow Diagram (Real-Time Inference)

Here's how a recommendation request flows through your system:

```mermaid
sequenceDiagram
    participant User as Frontend
    participant API as REST API
    participant Cache as Cache Layer
    participant FS as Feature Store
    participant Model as Model Server
    participant DB as Product DB

    User->>API: GET /recommendations?user_id=123
    API->>Cache: Check cached recommendations
    alt Cache Hit (fresh <1hr)
        Cache-->>API: Return cached recommendations
    else Cache Miss
        API->>FS: Fetch user features (embeddings)
        FS-->>API: User embedding
        API->>FS: Fetch product features (embeddings)
        FS-->>API: Product embeddings (batch)
        API->>Model: Predict scores (user_emb, prod_embs)
        Model-->>API: Recommendation scores
        API->>DB: Fetch product metadata (top 10)
        DB-->>API: Titles, images, prices
        API->>Cache: Store recommendations for 1 hour
        Cache-->>API: Stored
    end
    API-->>User: JSON [recs with titles, images, prices]

Notice the caching strategy? We don't recompute recommendations every time—that would kill our latency SLA. Cache for 1 hour, recompute if stale.

Monitoring and Incident Response

You ship your model. Congrats. Now what goes wrong?

Daily Monitoring Checklist:

  • Model inference latency (p50, p99)
  • Recommendation diversity (are we recommending the same products to everyone?)
  • User engagement on recommendations (clicks, conversions)
  • Model prediction distribution (has it changed from training?)
  • Feature freshness (are all features being updated?)

If Engagement Drops Suddenly:

  1. Check data pipeline—are features being computed?
  2. Check model serving—is latency spiking?
  3. Compare current predictions to yesterday's
  4. If model predictions are garbage, roll back to previous version
  5. Investigate root cause offline

If Latency Spikes:

  1. Check feature store query time
  2. Check model inference time
  3. If feature store is slow, revert to cached features
  4. If model is slow, switch to CPU/batch inference temporarily

If Model Performance Degrades (engagement drops, AUC drops):

  1. Run data quality checks—are labels corrupted?
  2. Check for distribution shift—has user behavior changed?
  3. Retrain on recent data
  4. If retraining doesn't help, investigate external factors (new competitor, seasonal shift, etc.)

The Incident Response Mindset

Incident response is where ML systems reveal their true character. Things will go wrong—not if, but when. The question isn't how to prevent all problems. It's how to detect them quickly and respond effectively. The teams that sleep well are the ones that have thought through their incident response process before the pager goes off at 3 AM.

The key insight is that your incident response playbook should be written as part of the design document, not discovered during a crisis. When you're debugging a production issue, you don't want to be figuring out simultaneously what to do and how to do it. You want a clear, battle-tested playbook that you can execute quickly. The design document is where you write that playbook.

Think about the different failure modes that could affect your system. Your model could make predictions that are garbage. Your feature pipeline could break. Your model serving infrastructure could crash. Your retraining job could fail silently. For each of these, you want to know: How will we detect it? What's our immediate response? What's our investigation process? When do we roll back vs. when do we push forward?

Another important aspect of incident response is understanding your system's failure modes deeply enough to prioritize what to monitor. You can't monitor everything, so you need to be strategic about what you watch. Monitor the metrics that, if they go wrong, would have the largest business impact. Monitor the technical metrics that are leading indicators of business impact. This is how you catch problems early before they cascade into customer-facing outages.

The teams we know that handle incidents best all share a common pattern: they have a clear escalation path, they have runbooks for common issues, and they regularly practice incident response. Some teams run monthly incident simulations where they deliberately break something and have their on-call engineer walk through the response. This sounds like overhead, but it's actually how you build the muscle memory to respond effectively under pressure.

ML-Specific Requirements Deep Dive

Let's zoom in on some ML-specific concerns that traditional system design docs don't cover but are critical for production systems.

Framing the Prediction Task

Here's where many teams stumble: they think they're building one thing but end up building another.

Are you building a ranking problem or a classification problem? These require different metrics and evaluation approaches.

  • Ranking: "Rank products by relevance to the user" → Optimize for NDCG, MRR (mean reciprocal rank), or novelty metrics
  • Classification: "Is this email spam or not?" → Optimize for precision, recall, F1-score
  • Regression: "Predict customer lifetime value" → Optimize for MSE, MAE, or calibration

Your problem framing determines your entire ML pipeline. Get this wrong and you'll optimize for the wrong metrics.

Example: An e-commerce team thought they were building a "classification" system (recommend or don't recommend). But their actual problem was "ranking" (show top 5 products by relevance). They optimized for accuracy (binary classification metric) but should have optimized for NDCG@5 (ranking metric). Result: their top-1 recommendation was great, but positions 2-5 were mediocre.

Precision vs. Recall Trade-Offs in Context

This is THE fundamental ML trade-off, and it's entirely business-dependent.

A spam filter wants high precision (very few false positives). If you block a legitimate email from your boss, that's catastrophic.

A cancer screening model wants high recall (catch every case). Missing one cancer diagnosis is worse than a false positive that requires follow-up.

Document your business context:

  • If False Positive cost = $100 and False Negative cost = $10,000, optimize for recall
  • If False Positive cost = $10,000 and False Negative cost = $100, optimize for precision
  • If they're equal cost, aim for balanced (F1-score)

Here's the killer insight: Your precision/recall target should be negotiated with stakeholders BEFORE you build the model. If you optimize the wrong metric, you've wasted months of work.

Create a simple cost matrix in your design doc:

ScenarioCost/ImpactAction
Recommend irrelevant product (false positive)User wastes 1 min browsing, 3% chance of angry reviewAcceptable
Fail to recommend relevant product (false negative)Lost sale (~$30 revenue), user goes to competitorNot acceptable
ConclusionOptimize for recall >90%, accept precision as low as 60%-

Training Data Freshness and Concept Drift

Your model learns from historical data. But users change. Preferences shift. Adversaries adapt. The data distribution today is different from yesterday.

This is concept drift, and it's how good models become bad models over time.

You need to decide:

  • How fresh does training data need to be? (Real-time stream? Daily batch? Weekly?)
  • How often do you retrain? (Continuously? Daily? Weekly? On-demand?)
  • How do you detect when performance degrades? (Daily monitoring? Weekly reviews? Automated alerts?)

Most teams say "we'll monitor daily" but don't actually set up the infrastructure. Then their model degrades silently for 3 months until someone notices engagement dropped.

Here's the right approach:

  1. Set up automated data quality checks (null rate, distribution shift, class imbalance)
  2. Monitor model performance metrics daily (AUC, recall, latency)
  3. Alert if performance drops >5% or latency spikes >50%
  4. Trigger automated retraining if performance drops >10%
  5. Have a manual incident process for bigger problems

Acceptable Degradation Gradient

Your model won't be perfect forever. It will degrade. The question is: how much degradation is acceptable before you act?

Define your degradation gradient:

MetricBaselineYellow (alert)Red (rollback)
Engagement rate12%Drops to 11.5%Drops to 11%
Inference latency p99120ms>180ms>250ms
Model AUC0.92Drops to 0.89Drops to 0.85
Error rate0.5%>0.8%>1.2%

When you hit yellow, investigate. When you hit red, rollback.

This prevents the "it's degrading slowly so we'll just live with it" situation where your system slowly becomes worse until you can't recover.

Advanced Topic: Multi-Objective Optimization

Most production ML systems optimize for multiple objectives simultaneously. Recommendation systems optimize for engagement AND diversity AND serendipity AND shelf time of inventory. You can't maximize all of them.

This is where Pareto optimization comes in. You're looking for the "best trade-off" between competing objectives.

Document this explicitly:

Objective 1: Maximize engagement (clicks)
Objective 2: Maximize diversity (don't recommend the same products)
Objective 3: Maximize profit margin (recommend high-margin items)

Weights:
- Engagement: 60% (primary business goal)
- Diversity: 25% (user satisfaction)
- Margin: 15% (profitability)

Combined score = 0.6 * engagement_score + 0.25 * diversity_score + 0.15 * margin_score

Then optimize your model to maximize that combined score.

Measuring Model Fairness

Here's a hard truth: your model is biased. Every model is. The question is whether the bias is acceptable to your business and users.

Common fairness metrics:

  • Disparate impact: Does the model treat different demographic groups differently? (e.g., are loan approvals different for different races?)
  • Calibration: If the model says 90% of X users will convert, do ~90% actually convert?
  • Individual fairness: Are similar users treated similarly?

You need to decide: what level of fairness is acceptable?

For recommendation system: AUC difference between demographic groups <2%
For lending: Approval rate difference between groups <3%
For hiring: False negative rate difference between groups <5%

These thresholds should be in your design doc, not discovered months after deployment.

Key Takeaways

An ML system design document is not busywork. It's how you:

  • Think clearly about your architecture before you code
  • Communicate with stakeholders about trade-offs
  • Plan for failure before your system fails
  • Scale confidently because you've thought through the details

The best ML engineers spend 30% of their time on the design document and 70% on implementation. The worst skip the document and spend 70% fighting production fires.

So next time you're tempted to jump straight to coding, pause. Grab the template above. Fill it out with your team. Then build.

Your future self in production will thank you.

The design document also becomes your primary communication tool with stakeholders. Non-technical executives don't care about your architecture—they care about business metrics. The design document translates your technical decisions into business outcomes. When you explain why you chose XGBoost over deep learning, you're ultimately explaining that it delivers better latency at acceptable accuracy, which means faster response times for customers. When you explain your monitoring strategy, you're explaining how you'll catch problems before they affect users. This translation between technical and business language is often the difference between getting buy-in and being told to figure it out later.

There's also a subtle but important aspect of design documents: they're political tools. When you document your reasoning upfront, you create a record that protects you later. If someone asks "why did we make this choice?" months later when circumstances have changed, you have an answer. You can point to the decision-making process you followed and the assumptions you made. If those assumptions no longer hold, you can have an informed conversation about what to change. This prevents the frustrating situation where you're defending past decisions without context.

Finally, the discipline of writing a design document forces clarity. If you can't explain your choice in writing, you probably don't understand it well enough. If you can't articulate the risks, you probably haven't thought about them. If you can't define success metrics, you won't be able to measure whether you succeeded. The document forces these questions before you start building. This is uncomfortable in the moment but invaluable later.

What Comes Next

Once your design doc is approved:

  1. Set up monitoring infrastructure first before training any models
  2. Write the rollback plan in code (literally, make a script)
  3. Build training infrastructure that can retrain easily and often
  4. Instrument everything (logs, metrics, traces)
  5. Run incident simulations before you have a real incident

A well-designed system is one where you've thought about failure modes upfront and built the tools to handle them.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project