Why FastAPI for Model Serving?

You've probably heard of FastAPI. It's the new-ish async web framework that's been stealing Flask's lunch. But why is it perfect for ML serving specifically?

It's fast. Async by default means your server doesn't block on I/O, predictions can queue up and process efficiently. No threads. No GIL headaches.

Validation is built-in. With Pydantic, your request/response schemas are self-validating. Bad data bounces back automatically. Your model doesn't even see garbage.

Async support for heavy lifting. If your prediction is CPU-bound (which it usually is), you can offload to a thread pool. If it's I/O (database checks, feature engineering), pure async rocks.

OpenAPI docs out of the box. Clients can see exactly what your endpoints expect. No more Slack messages asking "what parameters does this take?"

Structured logging. With libraries like structlog, you can emit JSON logs with prediction latencies, input shapes, and error traces. Ops loves that.

Let's build something real.

Serving Architecture

Before you write your first endpoint, it's worth understanding the architecture you're building toward, because the decisions you make in the first 20 lines of code will constrain you for the lifetime of the service.

A production ML serving architecture has three distinct layers. The first is the interface layer: your FastAPI application, which handles HTTP concerns like routing, request parsing, response serialization, authentication, and rate limiting. This layer knows nothing about machine learning, it only knows about HTTP. The second is the inference layer: the code that takes validated, normalized inputs and feeds them to your model. This layer knows about your model's expected input format, handles batch construction, manages the model lifecycle, and returns raw predictions. The third is the observability layer: structured logging, metrics emission, health check reporting, and alerting. This layer watches everything that happens and makes it visible to humans and automated systems.

Keeping these layers conceptually separate matters enormously in practice. When you mix HTTP routing logic with feature engineering with logging calls in a single endpoint function, the result is untestable, un-maintainable, and nearly impossible to debug under load. The code we're building in this article respects these boundaries: FastAPI handles the HTTP layer, Pydantic models define the interface contracts, a dedicated ModelState class manages the inference layer, and structlog handles the observability layer. This isn't over-engineering, it's the minimum viable structure for a service you actually want to run in production. A 200-line FastAPI file organized this way is worth more than a 2000-line monolith that technically works.

The Production FastAPI Model Server

Here's the skeleton of what we're building:

python

from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse
from contextlib import asynccontextmanager
import pickle
import numpy as np
from pydantic import BaseModel, Field
import logging
import time
import structlog
from datetime import datetime
 
# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)
logger = structlog.get_logger()
 
# Define request/response schemas
class PredictionRequest(BaseModel):
    features: list[float] = Field(..., description="Input feature vector")
 
    class Config:
        json_schema_extra = {
            "example": {
                "features": [0.5, 1.2, 3.4, 2.1]
            }
        }
 
class PredictionResponse(BaseModel):
    prediction: float
    confidence: float
    latency_ms: float
    timestamp: str
 
class HealthResponse(BaseModel):
    status: str
    model_loaded: bool
    timestamp: str
 
# Global state for model
class ModelState:
    def __init__(self):
        self.model = None
        self.model_loaded = False
 
model_state = ModelState()
 
# Lifespan context manager for startup/shutdown
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    logger.info("model_loading", event="startup")
    try:
        with open("./models/model.pkl", "rb") as f:
            model_state.model = pickle.load(f)
        model_state.model_loaded = True
        logger.info("model_loaded", event="startup_complete", status="success")
    except Exception as e:
        logger.error("model_load_failed", event="startup", error=str(e))
        model_state.model_loaded = False
 
    yield
 
    # Shutdown
    logger.info("server_shutdown", event="shutdown")
    model_state.model = None
    model_state.model_loaded = False
 
app = FastAPI(
    title="Model Serving API",
    description="Production-grade ML model serving with structured logging",
    version="1.0.0",
    lifespan=lifespan
)
 
@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Kubernetes liveness probe endpoint."""
    return HealthResponse(
        status="healthy" if model_state.model_loaded else "unhealthy",
        model_loaded=model_state.model_loaded,
        timestamp=datetime.utcnow().isoformat()
    )
 
@app.get("/ready", response_model=HealthResponse)
async def readiness_check():
    """Kubernetes readiness probe endpoint."""
    if not model_state.model_loaded:
        raise HTTPException(status_code=503, detail="Model not loaded")
    return HealthResponse(
        status="ready",
        model_loaded=True,
        timestamp=datetime.utcnow().isoformat()
    )
 
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    """Single synchronous prediction endpoint."""
    start_time = time.perf_counter()
 
    if not model_state.model_loaded:
        logger.error("prediction_failed", reason="model_not_loaded")
        raise HTTPException(status_code=503, detail="Model not ready")
 
    try:
        # Validate input
        if len(request.features) != 4:
            logger.warning("prediction_invalid_shape", shape=len(request.features))
            raise HTTPException(status_code=400, detail="Expected 4 features")
 
        # Make prediction
        features_array = np.array(request.features).reshape(1, -1)
        prediction = float(model_state.model.predict(features_array)[0])
        confidence = float(model_state.model.predict_proba(features_array).max())
 
        latency_ms = (time.perf_counter() - start_time) * 1000
 
        logger.info(
            "prediction_success",
            input_shape=features_array.shape,
            prediction=prediction,
            confidence=confidence,
            latency_ms=latency_ms
        )
 
        return PredictionResponse(
            prediction=prediction,
            confidence=confidence,
            latency_ms=latency_ms,
            timestamp=datetime.utcnow().isoformat()
        )
 
    except Exception as e:
        latency_ms = (time.perf_counter() - start_time) * 1000
        logger.error(
            "prediction_error",
            error=str(e),
            latency_ms=latency_ms,
            exc_info=True
        )
        raise HTTPException(status_code=500, detail="Prediction failed")
 
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Let's break down what's happening here. The skeleton above packs a surprising amount of production thinking into roughly 100 lines: the lifespan context manager handles model loading and teardown without any request-time overhead, the Pydantic models enforce input/output contracts at the framework level, and structlog is configured at startup to emit clean JSON rather than human-readable strings that are impossible to parse at scale. Notice that we're not using global variables carelessly, we have a dedicated ModelState class to hold the model, which makes it trivially easy to add model versioning or hot-reloading later without touching the endpoint code.

Lifespan Context Manager: Load Your Model Once

The @asynccontextmanager decorator (new in FastAPI 0.93+) replaces the old @app.on_event("startup") pattern. It's cleaner and more explicit:

python

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Code here runs at startup
    logger.info("model_loading", event="startup")
    with open("./models/model.pkl", "rb") as f:
        model_state.model = pickle.load(f)
 
    yield  # App runs here
 
    # Code here runs at shutdown
    logger.info("server_shutdown", event="shutdown")
    model_state.model = None

Why this matters: Your model loads once when the server starts, not on every request. Loading a 500MB neural net on each prediction is a death sentence for latency. This ensures it's in memory, ready to go.

The yield is the magic. Everything before it is startup. Everything after is shutdown. Simple.

The lifespan pattern also lets you do other one-time initialization alongside model loading: pre-computing lookup tables, warming up database connection pools, loading feature scalers or tokenizers, or even running a sanity-check prediction against known inputs to verify the model loaded correctly before the server starts accepting traffic. This is the right place for all of that setup work, not inside your endpoint functions where it would run on every request. If anything fails during startup, the model file is missing, the pickle is corrupted, the GPU isn't available, the lifespan context manager is where you catch it and log it clearly, so the service fails fast with a meaningful error rather than silently serving garbage.

Pydantic Schemas: Validation That Doesn't Suck

Your API isn't just a black box. Clients need to know what to send and what to expect back.

python

class PredictionRequest(BaseModel):
    features: list[float] = Field(..., description="Input feature vector")

Pydantic does three things here:

Type enforcement: If someone sends {"features": ["a", "b", "c"]}, it bounces with a 422 error. No guessing.
Documentation: FastAPI auto-generates OpenAPI docs. Clients see exactly what they need to send.
Serialization: Python types (lists, dicts) become JSON automatically.

The Field(...) syntax lets you add descriptions, examples, and constraints:

python

class BatchPredictionRequest(BaseModel):
    features: list[list[float]] = Field(
        ...,
        min_items=1,
        max_items=1000,
        description="Batch of feature vectors (max 1000)"
    )

Now your API rejects requests with >1000 samples before they even hit your model. This is one of those features that pays dividends you won't appreciate until 3am when an errant client sends a 50,000-sample batch and your server stays up instead of running out of memory. Pydantic validation runs before your endpoint function body executes, which means your model code never sees malformed input, the framework handles the rejection and formats the error response automatically. Beyond the safety benefits, well-defined Pydantic models are self-documenting: navigate to /docs on your running server and you'll see an interactive Swagger UI showing exactly what every endpoint expects, with example values, field descriptions, and type information that FastAPI generated entirely from your schema definitions.

Sync vs Async Prediction Endpoints

Here's the question: Should your prediction endpoint be async or sync?

If your prediction is CPU-bound (loading a 2GB model, running inference on a GPU), it doesn't matter if you make it async def. Python's event loop will still block on the actual computation. Use async anyway, it helps the framework manage request queuing.

python

@app.post("/predict")
async def predict(request: PredictionRequest):
    # This is still CPU-bound, but async helps with queuing
    features = np.array(request.features).reshape(1, -1)
    prediction = float(model_state.model.predict(features)[0])
    return PredictionResponse(prediction=prediction, confidence=0.0, latency_ms=0.0, timestamp="")

If you need to do I/O (database lookups, external API calls), go async:

python

@app.post("/predict-with-context")
async def predict_with_context(request: PredictionRequest):
    # I/O operation - truly async
    user_context = await fetch_from_db(request.user_id)
    # Combine features with context
    augmented_features = combine(request.features, user_context)
    # Prediction is still CPU-bound, but the DB call was async
    prediction = model_state.model.predict(augmented_features)
    return PredictionResponse(...)

For CPU-bound work, if you really need async behavior, offload to a thread pool:

python

from concurrent.futures import ThreadPoolExecutor
 
executor = ThreadPoolExecutor(max_workers=4)
 
@app.post("/predict-threaded")
async def predict_threaded(request: PredictionRequest):
    loop = asyncio.get_event_loop()
    # Offload model.predict to a thread
    prediction = await loop.run_in_executor(
        executor,
        model_state.model.predict,
        np.array(request.features).reshape(1, -1)
    )
    return PredictionResponse(...)

But honestly? For single-sample predictions, a regular async endpoint is fine. The magic is in the next section.

Batch Prediction for Real Throughput

Single-sample predictions are useful for real-time APIs. But if you want throughput, batch 'em up. The fundamental reason batching works is that most ML frameworks, scikit-learn, PyTorch, TensorFlow, are built around vectorized matrix operations that scale sub-linearly with batch size. The overhead of setting up a computation (loading weights into cache, initializing GPU kernels, allocating memory) is a fixed cost you pay regardless of batch size, so spreading that cost across 100 samples instead of 1 dramatically improves amortized throughput.

python

class BatchPredictionRequest(BaseModel):
    batch: list[list[float]] = Field(..., min_items=1, max_items=1000)
 
class BatchPredictionResponse(BaseModel):
    predictions: list[float]
    confidences: list[float]
    latency_ms: float
    timestamp: str
 
@app.post("/predict-batch", response_model=BatchPredictionResponse)
async def predict_batch(request: BatchPredictionRequest):
    """Batch predictions (up to 1000 samples)."""
    start_time = time.perf_counter()
 
    if not model_state.model_loaded:
        raise HTTPException(status_code=503, detail="Model not ready")
 
    try:
        features_array = np.array(request.batch)
        predictions = model_state.model.predict(features_array).tolist()
        confidences = model_state.model.predict_proba(features_array).max(axis=1).tolist()
 
        latency_ms = (time.perf_counter() - start_time) * 1000
 
        logger.info(
            "batch_prediction_success",
            batch_size=len(request.batch),
            latency_ms=latency_ms,
            avg_latency_per_sample_ms=latency_ms / len(request.batch)
        )
 
        return BatchPredictionResponse(
            predictions=predictions,
            confidences=confidences,
            latency_ms=latency_ms,
            timestamp=datetime.utcnow().isoformat()
        )
 
    except Exception as e:
        logger.error("batch_prediction_error", error=str(e), exc_info=True)
        raise HTTPException(status_code=500, detail="Batch prediction failed")

Why this is faster: Your model loves vectorized operations. Processing 100 samples at once is way faster than processing them one-by-one. The overhead of loading the model, setting up computation, etc. is amortized across 100 predictions.

Benchmark: A single prediction might take 50ms. But 100 predictions in a batch might take 150ms total, that's 1.5ms per prediction. Huge difference. Notice we're also logging avg_latency_per_sample_ms in the batch endpoint, this is a metric you'll want to track over time because it tells you whether your batch efficiency is degrading as model complexity increases or as you scale to larger feature vectors. If your per-sample latency in batch mode starts approaching your single-sample latency, that's a signal your bottleneck has shifted from setup overhead to actual computation, and it's time to look at hardware upgrades, quantization, or a more efficient inference runtime.

Batching and Optimization

The batch endpoint you just built is a good start, but production batch serving has additional dimensions worth understanding. The first is dynamic batching: rather than requiring clients to assemble their own batches, your server accumulates individual requests over a short window (say, 5-10ms) and automatically groups them into a single model call. This is what serving frameworks like Triton Inference Server and TorchServe implement under the hood, and it's particularly valuable for real-time APIs where clients send individual samples but the server can still take advantage of batching internally.

The second optimization dimension is input normalization. You almost certainly trained your model on normalized data, and the normalization parameters (mean, standard deviation, min, max) live in a separate artifact, usually a scikit-learn StandardScaler or a custom preprocessing pipeline. Load that artifact in your lifespan context manager alongside the model itself, and apply it in the endpoint before calling model.predict(). Never ask clients to normalize their own inputs; that pushes domain knowledge about your training data pipeline into every client, making it impossible to update the normalization logic without breaking all of them simultaneously.

The third dimension is output post-processing. Raw model outputs are often not what clients need. A classification model returns a probability vector; clients usually want the class label and the top-1 confidence score. A regression model returns a float; clients might need it rounded to two decimal places and clipped to a valid range. Do this post-processing in your endpoint layer, not in the client. That way, if your model changes and the output range shifts, you fix it in one place. These three optimizations, dynamic batching, server-side normalization, and server-side post-processing, together can reduce median latency by 40-60% compared to a naive endpoint that does none of them.

Health Checks and Monitoring

Health checks are not optional if you're deploying to any orchestration platform. Kubernetes, ECS, Nomad, they all need to know whether your service is alive and whether it can handle traffic, and they need that information on a short polling interval (typically 5-30 seconds). If you don't provide health endpoints, the orchestrator has no way to distinguish between a pod that's running fine and a pod that loaded but immediately crashed trying to deserialize the model. Your service will get traffic routed to it regardless, and you'll see mysterious 500s in production that are extremely difficult to debug after the fact.

python

@app.get("/health")
async def health_check():
    """Liveness probe - is the process running?"""
    return HealthResponse(
        status="healthy" if model_state.model_loaded else "unhealthy",
        model_loaded=model_state.model_loaded,
        timestamp=datetime.utcnow().isoformat()
    )
 
@app.get("/ready")
async def readiness_check():
    """Readiness probe - can you serve traffic?"""
    if not model_state.model_loaded:
        raise HTTPException(status_code=503, detail="Model not loaded")
    return HealthResponse(status="ready", model_loaded=True)

In your K8s manifest:

yaml

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
 
readinessProbe:
  httpGet:
    path: /ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5

The liveness probe checks if the process is alive (loose definition, as long as the model loaded). The readiness probe checks if you can actually handle requests. K8s will restart the pod if liveness fails, and remove it from the load balancer if readiness fails.

Beyond basic liveness and readiness, consider adding a /metrics endpoint that exposes Prometheus-compatible metrics: request count, error rate, latency percentiles, and model-specific metrics like average confidence score and the distribution of predicted classes. Prometheus can scrape this endpoint every 15 seconds and Grafana can build dashboards that show you at a glance whether your model is degrading, either because of infrastructure problems (latency spiking) or because of data drift (confidence scores trending downward as the incoming data distribution shifts away from your training distribution). That second signal, model confidence drift, is something no amount of infrastructure monitoring will catch for you; you have to instrument it yourself in your application layer.

Structured Logging with Latencies

Generic logs are useless. You need to log latency, model performance, and errors in a way that's parseable and queryable.

python

import structlog
 
logger = structlog.get_logger()
 
# In your prediction endpoint
start_time = time.perf_counter()
prediction = model_state.model.predict(features_array)
latency_ms = (time.perf_counter() - start_time) * 1000
 
logger.info(
    "prediction_success",
    input_shape=features_array.shape,
    prediction=prediction,
    confidence=confidence,
    latency_ms=latency_ms
)

This produces JSON:

json

{
  "event": "prediction_success",
  "input_shape": [1, 4],
  "prediction": 0.87,
  "confidence": 0.92,
  "latency_ms": 12.5,
  "timestamp": "2026-02-25T10:30:45.123456"
}

Now you can:

Query latencies: jq '.latency_ms' logs.jsonl | stats mean
Track model drift: jq 'select(.confidence < 0.5)' logs.jsonl
Find errors: jq 'select(.level == "error")' logs.jsonl

This is the difference between "the model is slow" and "median latency is 45ms, p95 is 200ms, and it spikes when batch size > 50." Structured logging also makes it straightforward to pipe your logs into centralized observability platforms like Datadog, Elasticsearch, or CloudWatch, because every log line is already valid JSON with consistent field names, you don't need brittle regex parsers or custom log format configurations. The investment you make in defining good log schemas upfront pays compound interest for the lifetime of the service.

Request Validation and Error Handling

Your model is fragile. It expects exactly 4 features, normalized to [0, 1], with no NaNs. Protect it.

python

@app.post("/predict")
async def predict(request: PredictionRequest):
    # Pydantic validates types automatically
    if len(request.features) != 4:
        raise HTTPException(
            status_code=400,
            detail="Expected 4 features, got {}".format(len(request.features))
        )
 
    # Check for NaNs or Infs
    features_array = np.array(request.features)
    if not np.isfinite(features_array).all():
        raise HTTPException(
            status_code=400,
            detail="Features contain NaN or Inf"
        )
 
    # Check range
    if (features_array < 0).any() or (features_array > 1).any():
        logger.warning("prediction_out_of_range", features=request.features)
        # You might log but still serve, depending on your tolerance
 
    # Safe to predict
    try:
        prediction = model_state.model.predict(features_array.reshape(1, -1))
    except Exception as e:
        logger.error("model_prediction_error", error=str(e), exc_info=True)
        raise HTTPException(status_code=500, detail="Prediction failed")

The key: fail fast with clear error messages. A 400 (bad request) is the client's fault. A 500 (server error) is yours. The out-of-range check deserves special attention: you might be tempted to silently clamp features to [0, 1] rather than warning or rejecting. Resist that temptation. Silently transforming inputs means your model is making predictions on data that doesn't match what the client sent, and if those predictions are used downstream in a business decision, you've introduced a subtle correctness bug that's almost impossible to detect later. Log the warning, let the prediction proceed, but make sure the out-of-range condition is visible in your logs so you can investigate whether clients are sending pre-normalized or raw data and fix the upstream pipeline accordingly.

Load Testing with Locust

Theory is nice. Reality is brutal. You need to know: How many requests per second can you handle? When does latency blow up?

Locust is a Python load testing tool. It's dead simple:

python

# locustfile.py
from locust import HttpUser, task, between
import random
 
class PredictionUser(HttpUser):
    wait_time = between(1, 3)  # Wait 1-3 seconds between requests
 
    @task(3)
    def single_prediction(self):
        """Single prediction endpoint - 3x weight."""
        features = [random.random() for _ in range(4)]
        self.client.post(
            "/predict",
            json={"features": features}
        )
 
    @task(1)
    def batch_prediction(self):
        """Batch prediction endpoint - 1x weight."""
        batch = [[random.random() for _ in range(4)] for _ in range(100)]
        self.client.post(
            "/predict-batch",
            json={"batch": batch}
        )
 
    def on_start(self):
        """Check health on startup."""
        self.client.get("/health")

Run it:

bash

locust -f locustfile.py --host=http://localhost:8000 --users=100 --spawn-rate=10

This simulates 100 concurrent users, ramping up at 10 per second. Locust gives you a web UI showing:

Requests/sec: How many you're handling
Response time: Mean, min, max, p50, p95, p99
Failures: Which endpoints are breaking
Throughput under load: How it degrades

You'll see reality: "My model does 100 predictions/sec solo, but at 50 concurrent users and 200 req/sec, latency p95 hits 2 seconds."

Then you optimize: add more workers, batch predictions, upgrade to a GPU, cache results, etc. The numbers Locust gives you are not just useful for optimization, they're the foundation of your capacity planning conversations with engineering leadership. "We can handle 500 req/sec at p95 < 100ms with 4 Uvicorn workers on a c5.2xlarge" is a concrete, defensible statement that you can use to justify infrastructure spend or to set SLA commitments with the product team. Run load tests before you deploy, and run them again after every significant model update, because a new model version can have dramatically different inference latency characteristics than the previous one.

Common Serving Mistakes

Every engineer who's taken a model to production has made at least one of these mistakes. We're listing them here so you don't have to learn them the hard way at 2am during an on-call incident.

The most common mistake is loading the model on every request. We covered this with the lifespan context manager, but it bears repeating: if you see pickle.load() inside an endpoint function body, that's a critical performance bug. A 200MB sklearn model takes 300-800ms to deserialize from disk. That's your entire latency budget gone before you've done a single computation. Load once at startup, keep in memory, serve forever.

The second common mistake is ignoring model warm-up. When a process first calls model.predict(), frameworks like PyTorch JIT and TensorFlow XLA spend time compiling computation graphs and warming up GPU kernels. The first prediction after startup can be 10-100x slower than subsequent ones. This means that if Kubernetes sends traffic to a new pod immediately after the readiness probe passes, those first few requests will see anomalous latency spikes. Fix it by running a dummy prediction in your lifespan startup code before setting model_loaded = True.

The third mistake is not handling the 503 case gracefully. If your model fails to load at startup, you need to return HTTP 503 from every prediction endpoint, not HTTP 500 and not a traceback. Downstream clients should treat 503 as "retry later" and 500 as "my request was broken." A failed model load is not the client's fault, and conflating it with internal prediction errors will make incident diagnosis much harder.

The fourth mistake is over-trusting Pydantic validation. Pydantic catches type errors but it doesn't catch semantic errors: a feature vector of 4 floats that are all 0.0 is valid JSON and will pass Pydantic validation, but it might represent a completely invalid input to your specific model. Add domain-specific validation checks (NaN detection, range checks, shape verification) manually in your endpoint code after Pydantic has done its job. Think of Pydantic as the syntax checker and your manual validation as the semantic checker.

Common Production Patterns

Your basic API works. Now let's talk about real-world complications. In the real world, you're not serving a single, static model. You're managing versions, handling traffic spikes, dealing with edge cases, and sweating the details that separate "works in dev" from "survives in production."

Model Versioning

Your data science team trains a new model. It's better. You want to deploy it without downtime. Solution: serve both versions and route based on headers or query params. This is called canary deployment or shadow traffic testing. You route a small percentage of live traffic to the new model, compare predictions with the old one, and gradually shift traffic as confidence grows.

python

class ModelRegistry:
    def __init__(self):
        self.models = {}
 
    def load_model(self, version: str):
        with open(f"./models/model-v{version}.pkl", "rb") as f:
            self.models[version] = pickle.load(f)
 
    def get_model(self, version: str):
        if version not in self.models:
            raise ValueError(f"Model version {version} not found")
        return self.models[version]
 
registry = ModelRegistry()
 
@app.on_event("startup")
async def startup():
    registry.load_model("1.0")
    registry.load_model("2.0")  # New model
 
@app.post("/predict")
async def predict(request: PredictionRequest, model_version: str = "1.0"):
    model = registry.get_model(model_version)
    features = np.array(request.features).reshape(1, -1)
    prediction = model.predict(features)[0]
    return PredictionResponse(prediction=float(prediction), ...)

Now clients can call /predict?model_version=2.0 to test the new model while 99% of traffic still hits v1.0. Gradual rollout. No downtime. Your ops team deploys the new container, it loads both models at startup, and traffic routing happens at the application layer. If the new model is garbage, you flip the switch back without redeploying.

Feature Engineering at Serving Time

Your model expects normalized features. Do you normalize in the client code (distributed, brittle) or on the server (centralized, testable)?

Always on the server.

python

class FeatureEngineer:
    def __init__(self):
        self.scaler = None
 
    def load_scaler(self):
        with open("./models/scaler.pkl", "rb") as f:
            self.scaler = pickle.load(f)
 
    def engineer(self, raw_features: list[float]) -> np.ndarray:
        raw = np.array(raw_features).reshape(1, -1)
        normalized = self.scaler.transform(raw)
        return normalized
 
engineer = FeatureEngineer()
 
@app.on_event("startup")
async def startup():
    engineer.load_scaler()
 
@app.post("/predict")
async def predict(request: PredictionRequest):
    # Engineer on server side
    features = engineer.engineer(request.features)
    prediction = model_state.model.predict(features)[0]
    return PredictionResponse(prediction=float(prediction), ...)

This decouples clients from your feature pipeline. You can change scaling logic, add new features, handle edge cases, all without touching client code.

Caching Predictions

If the same request comes in multiple times in a minute, why recompute? Cache it.

python

from functools import lru_cache
import hashlib
 
def hash_features(features: list[float]) -> str:
    """Convert feature vector to cache key."""
    return hashlib.md5(str(features).encode()).hexdigest()
 
# Simple in-memory cache with TTL
from cachetools import TTLCache
 
prediction_cache = TTLCache(maxsize=10000, ttl=60)  # 60-second TTL
 
@app.post("/predict")
async def predict(request: PredictionRequest):
    start_time = time.time()
    cache_key = hash_features(request.features)
 
    if cache_key in prediction_cache:
        logger.info("cache_hit", key=cache_key)
        return prediction_cache[cache_key]
 
    features = np.array(request.features).reshape(1, -1)
    prediction = float(model_state.model.predict(features)[0])
    confidence = float(model_state.model.predict_proba(features).max())
    latency_ms = (time.time() - start_time) * 1000
 
    response = PredictionResponse(
        prediction=prediction,
        confidence=confidence,
        latency_ms=latency_ms
    )
 
    prediction_cache[cache_key] = response
    logger.info("cache_miss", key=cache_key)
    return response

Watch your metrics: cache hit rate tells you if clients are asking the same questions over and over.

Rate Limiting and Request Queuing

Your server can handle 1000 req/sec. What happens at 10,000?

Option 1: Return 429 (Too Many Requests). Tell clients to back off.

Option 2: Queue them and process in batches (if you support it).

Option 3: Use a reverse proxy (nginx, HAProxy) to queue and rate-limit transparently.

For now, FastAPI + slowapi library:

python

from slowapi import Limiter
from slowapi.util import get_remote_address
 
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
 
@app.post("/predict")
@limiter.limit("100/minute")
async def predict(request: PredictionRequest, request_obj: Request):
    # Process prediction
    pass  # implement prediction logic

This limits each IP to 100 requests/minute. Beyond that: 429.

Graceful Shutdown

Your server is running 50 active predictions. You deploy a new version. Kubernetes sends SIGTERM. What do you do?

Don't kill active requests. Let them finish. Then shut down.

python

import signal
import asyncio
 
active_predictions = 0
shutdown_event = asyncio.Event()
 
@app.post("/predict")
async def predict(request: PredictionRequest):
    global active_predictions
    active_predictions += 1
 
    try:
        # Check if shutdown was triggered
        if shutdown_event.is_set():
            raise HTTPException(status_code=503, detail="Server shutting down")
 
        features = np.array(request.features).reshape(1, -1)
        prediction = model_state.model.predict(features)[0]
        return PredictionResponse(...)
 
    finally:
        active_predictions -= 1
 
def signal_handler(signum, frame):
    logger.info("shutdown_signal", signal=signum)
    shutdown_event.set()
 
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)

In your K8s manifest, give the container a 30-second grace period:

yaml

terminationGracePeriodSeconds: 30

Kubernetes will wait 30 seconds for active requests to finish before force-killing the container.

Performance Tuning Tips

You've built it. Now make it fast.

Use a production ASGI server. Don't run uvicorn directly. Use Gunicorn with Uvicorn workers:

bash

pip install gunicorn
gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app

This gives you 4 worker processes, each with async event loop. Much better throughput.

Profile your model. Use Python's cProfile to see where cycles are burning:

python

import cProfile
 
def profile_prediction():
    features = np.random.random((1, 4))
    model_state.model.predict(features)
 
cProfile.run('profile_prediction()', sort='cumtime')

You might find 90% of time is spent on a single operation. Fix that, and everything improves.

Use GPU if available. If your model is deep learning (PyTorch, TensorFlow), move it to GPU:

python

model_state.model.to('cuda')  # PyTorch
# or
model_state.model = tf.keras.models.load_model(..., device='/GPU:0')  # TensorFlow

GPU inference is orders of magnitude faster for neural nets.

Cache model outputs if sensible. If the same customer asks the same question daily, their prediction shouldn't change unless the model retrains. Cache it in Redis:

python

import redis
 
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)
 
@app.post("/predict")
async def predict(request: PredictionRequest):
    key = f"pred:{hash_features(request.features)}"
    cached = cache.get(key)
 
    if cached:
        logger.info("redis_cache_hit")
        return json.loads(cached)
 
    # Compute prediction
    features = np.array(request.features).reshape(1, -1)
    prediction = float(model_state.model.predict(features)[0])
    response = PredictionResponse(prediction=prediction, confidence=0.0, latency_ms=0.0)
 
    # Store for 1 hour
    cache.setex(key, 3600, response.json())
    return response

Monitoring in Production

Your API is live. Now what?

Log latency percentiles. Track p50, p95, p99:

python

import numpy as np
 
latencies = []
 
@app.post("/predict")
async def predict(...):
    start = time.perf_counter()
    # ... prediction ...
    latency = time.perf_counter() - start
    latencies.append(latency)
 
    if len(latencies) % 1000 == 0:
        logger.info(
            "latency_stats",
            p50=np.percentile(latencies, 50),
            p95=np.percentile(latencies, 95),
            p99=np.percentile(latencies, 99)
        )

Track model confidence. Low confidence predictions might be garbage:

python

confidence_histogram = defaultdict(int)
 
confidence = model.predict_proba(features).max()
bucket = int(confidence * 10) / 10  # Group by 0.1 intervals
confidence_histogram[bucket] += 1
 
if len(confidence_histogram) % 1000 == 0:
    logger.info("confidence_distribution", histogram=dict(confidence_histogram))

Monitor error rates. Which endpoints fail most? Why?

python

from collections import defaultdict
 
errors = defaultdict(int)
 
try:
    # prediction code
except Exception as e:
    errors[str(type(e).__name__)] += 1
    raise

The Complete Picture

Model serving looks simple: load model, take input, return output. But production is 90% everything else:

Loading models once and keeping them in memory (lifespan)
Validating inputs ruthlessly (Pydantic)
Logging structured data (structlog + latencies)
Handling multiple versions (routing)
Feature engineering on server (decoupling)
Caching predictions (throughput)
Rate limiting (protection)
Graceful shutdown (reliability)
Performance profiling (optimization)
Health checks (observability)
Monitoring (sanity)

FastAPI gives you the foundation. The rest is diligence.

Build it right the first time. Your ops team will actually be happy, and your model will actually serve the business instead of rotting as a Jupyter notebook.

Putting It Together: A Complete Example

Here's the full server with all the pieces:

python

from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse
from contextlib import asynccontextmanager
import pickle
import numpy as np
from pydantic import BaseModel, Field
import logging
import time
import structlog
from datetime import datetime
import asyncio
from concurrent.futures import ThreadPoolExecutor
 
# Structured logging setup
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)
logger = structlog.get_logger()
 
# Request/Response models
class PredictionRequest(BaseModel):
    features: list[float] = Field(..., min_items=4, max_items=4)
 
class BatchPredictionRequest(BaseModel):
    batch: list[list[float]] = Field(..., min_items=1, max_items=1000)
 
class PredictionResponse(BaseModel):
    prediction: float
    confidence: float
    latency_ms: float
 
class BatchPredictionResponse(BaseModel):
    predictions: list[float]
    confidences: list[float]
    latency_ms: float
 
class HealthResponse(BaseModel):
    status: str
    model_loaded: bool
 
# Model state
class ModelState:
    def __init__(self):
        self.model = None
        self.model_loaded = False
 
model_state = ModelState()
executor = ThreadPoolExecutor(max_workers=4)
 
# Lifespan
@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("startup", event="loading_model")
    try:
        with open("./models/model.pkl", "rb") as f:
            model_state.model = pickle.load(f)
        model_state.model_loaded = True
        logger.info("startup", event="model_loaded")
    except Exception as e:
        logger.error("startup", event="model_load_failed", error=str(e))
 
    yield
 
    logger.info("shutdown", event="cleaning_up")
    model_state.model = None
 
app = FastAPI(title="Model Serving API", lifespan=lifespan)
 
@app.get("/health")
async def health():
    return HealthResponse(
        status="ok" if model_state.model_loaded else "degraded",
        model_loaded=model_state.model_loaded
    )
 
@app.post("/predict")
async def predict(request: PredictionRequest):
    start = time.perf_counter()
 
    if not model_state.model_loaded:
        raise HTTPException(status_code=503, detail="Model not ready")
 
    try:
        features = np.array(request.features).reshape(1, -1)
        pred = float(model_state.model.predict(features)[0])
        conf = float(model_state.model.predict_proba(features).max())
        latency = (time.perf_counter() - start) * 1000
 
        logger.info("prediction", prediction=pred, confidence=conf, latency_ms=latency)
        return PredictionResponse(prediction=pred, confidence=conf, latency_ms=latency)
 
    except Exception as e:
        logger.error("prediction_error", error=str(e))
        raise HTTPException(status_code=500, detail="Prediction failed")
 
@app.post("/predict-batch")
async def batch_predict(request: BatchPredictionRequest):
    start = time.perf_counter()
 
    if not model_state.model_loaded:
        raise HTTPException(status_code=503, detail="Model not ready")
 
    try:
        features = np.array(request.batch)
        preds = model_state.model.predict(features).tolist()
        confs = model_state.model.predict_proba(features).max(axis=1).tolist()
        latency = (time.perf_counter() - start) * 1000
 
        logger.info(
            "batch_prediction",
            batch_size=len(request.batch),
            latency_ms=latency
        )
        return BatchPredictionResponse(
            predictions=preds,
            confidences=confs,
            latency_ms=latency
        )
 
    except Exception as e:
        logger.error("batch_prediction_error", error=str(e))
        raise HTTPException(status_code=500, detail="Batch prediction failed")
 
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run it:

bash

pip install fastapi uvicorn pydantic structlog scikit-learn numpy
python main.py

Test it:

bash

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [0.5, 0.3, 0.8, 0.2]}'

You'll get:

json

{
  "prediction": 0.87,
  "confidence": 0.92,
  "latency_ms": 2.34
}

Key Takeaways

Model serving is NOT just prediction. It's:

Loading models efficiently (lifespan context manager)
Validating inputs ruthlessly (Pydantic)
Structured logging that tells you why things happened (structlog)
Health checks that ops teams understand (K8s probes)
Batch endpoints for throughput (vectorized predictions)
Load testing to find the ceiling (Locust)

Latency matters. A 10ms improvement per prediction = 166 extra requests/second. That's a whole business metric.

Logging is instrumentation. If you can't measure it, you can't optimize it.

Batch when you can. Single predictions are fine for real-time APIs, but if you're processing a queue of requests, batch them and watch latency drop.

FastAPI gives you all the tools. Use them. Your ops team will thank you, and your model will actually serve traffic instead of rotting in a notebook.

Conclusion

Model serving is one of those disciplines where the technical gap between "it works" and "it works in production" is enormous, and most of that gap is invisible until you cross it under fire. The endpoint that runs fine on your laptop, answering one request every few seconds, will expose a dozen subtle design flaws the moment it faces 200 concurrent users, model loading failures, malformed inputs, and Kubernetes health check polling all happening simultaneously. The patterns we've covered in this article, lifespan-managed model loading, Pydantic validation, structured JSON logging, explicit health and readiness endpoints, batch prediction, load testing, and graceful shutdown, aren't gold-plating. They're the minimum viable set of concerns for a serving layer that you'd actually be comfortable putting your name on.

FastAPI's real contribution here isn't just performance. It's that the framework's design actively nudges you toward these good practices: the lifespan API makes single-load the natural choice, Pydantic models make validation the path of least resistance, and OpenAPI generation means documentation is never an afterthought. The ops concerns we covered, health checks, structured logging, graceful shutdown, require a little more deliberate effort, but they're all straightforward to implement once you understand what problem each one solves. Start with the skeleton in this article, add the production patterns that fit your specific workload, run Locust against it before you deploy, and you'll have a serving layer that's genuinely ready for production traffic.

Ready to containerize this thing? Next up: Docker and making your API bulletproof for production.

Model Serving with FastAPI

Why FastAPI for Model Serving?

Serving Architecture

The Production FastAPI Model Server

Lifespan Context Manager: Load Your Model Once

Pydantic Schemas: Validation That Doesn't Suck

Sync vs Async Prediction Endpoints

Batch Prediction for Real Throughput

Batching and Optimization

Health Checks and Monitoring

Structured Logging with Latencies

Request Validation and Error Handling

Load Testing with Locust

Common Serving Mistakes

Common Production Patterns

Model Versioning

Feature Engineering at Serving Time

Caching Predictions

Rate Limiting and Request Queuing

Graceful Shutdown

Performance Tuning Tips

Monitoring in Production

The Complete Picture

Putting It Together: A Complete Example

Key Takeaways

Conclusion

Need help implementing this?