The Project: Bank Customer Churn Prediction

Here's the scenario: A bank wants to identify customers likely to close their accounts in the next month. They have historical data on customer behavior, account features, and churn labels. Your job: build a system that predicts churn with actionable precision, then serve it via API.

Why this problem? It's realistic. It has a real business objective. It requires all the techniques you've learned, and it mimics how ML works in the wild. More importantly, it forces you to think about business impact, not just model accuracy. A churn prediction system that's 99% accurate but catches zero real churners is worse than useless, it's a waste of engineering time and money.

In the real world, your models are judged not on metrics but on their impact: Did they prevent churn? Did they save money? Did they reduce false positives that waste the sales team's time? We'll build toward that mindset from the very beginning.

Churn prediction is one of the highest-ROI use cases in machine learning. If your bank retains even a handful of customers who would have left otherwise, the model pays for itself. Unlike some ML applications that solve interesting academic problems with unclear business value, churn prediction directly impacts the bottom line. This is why it's tackled seriously by banks, SaaS companies, and telecom providers worldwide.

The dataset you're working with represents real customer behavior across dimensions like credit score, age, tenure, account balance, and product usage. These aren't random features, they're carefully collected because past business intelligence revealed they correlate with churn. Your job is to learn those patterns and predict new cases.

Phase 1: Problem Definition and Success Metrics

Before you touch a single line of data, talk to stakeholders. What does success actually mean? This step is where most data scientists stumble. They dive into EDA or modeling without understanding what the business actually needs. That leads to beautiful models that no one uses.

Define Success with Business Partners

markdown

# stakeholder_requirements.md
 
## Problem Statement
 
- Predict customer churn within 30 days
- Current churn rate: 27%
- Intervention cost: ~$500/customer
- Retention value (if successful): ~$2000/customer
 
## Success Criteria
 
- Minimize false negatives (miss a churner = lose customer + revenue)
- But false positives are okay (costly outreach, but better than losing them)
- Target: 80%+ recall on churn class
- Must be actionable: explain why each prediction was made

Your metrics aren't just accuracy. They're tied to business outcomes. Here's why this matters: if your goal is to catch all potential churners, you care deeply about recall (how many actual churners you correctly identify). You're willing to accept lower precision (meaning you'll reach out to some customers who wouldn't have churned anyway) because the cost of missing even one churner is high. But in other problems, spam detection, for example, the equation flips. You'd rather miss some spam than falsely label legitimate emails, so precision becomes the priority.

Key decision: Are you optimizing for recall (catch all churners) or precision (only reach out to likely churners)?

For churn, the cost of a false negative (missing a churner) far exceeds a false positive (unnecessary outreach). So you'll aim for high recall, even if precision drops. You might reach out to some people who wouldn't have left anyway, but that's fine. The ROI on retention outweighs the cost.

Define Data Success Criteria

python

# What counts as "good" data for this project?
SUCCESS_CRITERIA = {
    "data_completeness": 0.95,  # 95%+ non-null in key features
    "temporal_coverage": "24+ months of history",
    "label_balance": "OK if imbalanced, but track baseline",
    "feature_relevance": "correlation with churn > 0.05 or business rationale"
}

Document these upfront. They guide your EDA and feature selection. Too many projects skip this step and end up with models trained on broken data. You want to know what "good enough" looks like before you start exploring. That gives you an objective way to validate your dataset and move forward with confidence.

Phase 2: Data Acquisition and Exploratory Data Analysis

You'll work in a Jupyter notebook, structured, reproducible, commented. This is where you shift from problem definition to empirical investigation. You'll ask: What's in this data? What's missing? What patterns jump out? This phase is about building intuition, not building models.

EDA (Exploratory Data Analysis) is one of the most underrated skills in machine learning. Many practitioners rush through it, eager to start modeling. But spending time understanding your data pays massive dividends. You'll spot data quality issues before they derail your models. You'll discover business logic embedded in the features. You'll identify outliers that need special handling. Most importantly, you'll develop a gut feeling for what your model should learn.

Think of EDA as a dialogue with your data. You ask questions (What's the distribution of age? How correlated are features?), examine the answers (visualizations, statistics), and ask follow-up questions based on what you find. This iterative process reveals the story encoded in your data.

Load and Profile Your Data

python

# notebooks/01_eda.ipynb
 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
 
# Load data
df = pd.read_csv('../data/raw/bank_churn.csv')
 
print(f"Shape: {df.shape}")
print(f"Duplicates: {df.duplicated().sum()}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"\nChurn distribution:\n{df['Churn'].value_counts(normalize=True)}")
 
# Basic statistics
df.describe()

This is your first moment of truth. You're checking: Is the data what we expected? Are there missing values we need to handle? Are the classes balanced? What's the data quality baseline?

Output:

Shape: (10000, 14)
Duplicates: 0
Missing values: 0
Churn distribution:
0    0.73
1    0.27

Good news: no missing values, and the data is balanced enough. You have 10,000 customer records with 14 features. The churn rate is 27%, which means about 2,700 customers churned and 7,300 didn't. This is slightly imbalanced (you'd prefer 50/50), but it's workable. You'll handle this later with class weighting in your models.

The absence of duplicates and missing values is a good sign. In many real datasets, you'll find duplicate records (same customer appearing twice) or missing values scattered throughout. These require investigation. Duplicates might indicate data pipeline bugs or legitimate customers with multiple transactions. Missing values might be random or systematic (missing because someone didn't fill out a field). Here, you got lucky, a clean dataset to start with. Enjoy it while it lasts, because this won't be your experience with most datasets.

Visualize Feature Relationships

python

# Understand what drives churn
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
 
# Age vs Churn
df.boxplot(column='Age', by='Churn', ax=axes[0, 0])
axes[0, 0].set_title('Age by Churn Status')
 
# Tenure vs Churn
df.boxplot(column='Tenure', by='Churn', ax=axes[0, 1])
axes[0, 1].set_title('Tenure by Churn Status')
 
# Credit Score vs Churn
df.boxplot(column='CreditScore', by='Churn', ax=axes[1, 0])
axes[1, 0].set_title('Credit Score by Churn Status')
 
# Balance vs Churn
df.boxplot(column='Balance', by='Churn', ax=axes[1, 1])
axes[1, 1].set_title('Account Balance by Churn Status')
 
plt.tight_layout()
plt.savefig('../plots/churn_distributions.png', dpi=300)

These visualizations let you see patterns without requiring statistical tests. Boxplots are perfect for this: they show the distribution of each feature split by the churn label. You're looking for clear separation. If the boxes (representing the middle 50% of data) don't overlap much between churners and non-churners, that feature is predictive.

Insight: Older customers, those with shorter tenure, and those with lower balances churn more often. This makes intuitive sense, they're less "sticky." A new customer with little money invested and high age might feel less attachment. A veteran customer with significant balance has more to lose by leaving. These aren't surprising patterns, but they're validating. Your data isn't random noise.

Check Correlations

python

# Numeric features only
numeric_cols = df.select_dtypes(include=[np.number]).columns
correlation = df[numeric_cols].corr()['Churn'].sort_values(ascending=False)
 
print(correlation)

Correlation analysis quantifies what the visualizations suggested. It tells you the linear relationship between each feature and your target (churn).

Output:

Churn           1.000000
Age             0.287550
NumOfProducts  -0.304129
Tenure          -0.369000
CreditScore    -0.025000
Balance         -0.058000

Tenure and number of products are your strongest predictors. Age matters. Credit score, surprisingly, doesn't. This is interesting: credit score has almost zero correlation with churn. In many banking problems, you'd expect financially healthier customers to stick around, but the data disagrees. This could mean the bank's churn is driven by service quality, not financial health. That's actionable intelligence you can share with stakeholders.

The insight that credit score doesn't predict churn should immediately trigger a follow-up conversation with the business. Maybe the data is wrong. Maybe credit score does matter but there's a confounding variable. Maybe the bank's customer base is predominantly middle-class with similar credit scores, masking the relationship. Anomalies in your data are opportunities to learn something, not problems to ignore.

Notice also that numeric correlations don't tell the whole story. Correlation measures linear relationships. Some predictors might have non-linear relationships with churn (age groups, for example, where young and old customers behave similarly but middle-aged customers differ). That's where feature engineering comes in. A linear model trained on raw features might miss these patterns entirely.

Phase 3: Feature Engineering and Data Preparation

Raw data rarely works. You need to engineer features that encode domain knowledge. Feature engineering is where subject matter experts outperform generic algorithms. You know that customer tenure, age groups, and product diversity matter. Your job is to make that knowledge explicit in your features.

Build a Robust Preprocessing Pipeline

python

# src/preprocessing.py
 
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
 
def create_preprocessor():
    """Create reusable preprocessing pipeline."""
 
    numeric_features = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']
    categorical_features = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']
 
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())
    ])
 
    categorical_transformer = Pipeline(steps=[
        ('onehot', pd.get_dummies)
    ])
 
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
 
    return preprocessor, numeric_features, categorical_features
 
preprocessor, numeric_cols, categorical_cols = create_preprocessor()
X_processed = preprocessor.fit_transform(X)

This code encapsulates your preprocessing logic in a reusable pipeline. Why reusable? Because when you train your final model, you need the exact same transformations. When you deploy to production, you apply the same preprocessing to new data. A common mistake is training on scaled features but then forgetting to scale in production, leading to terrible predictions. Pipelines prevent this by keeping preprocessing and modeling together.

The code normalizes numeric features (using StandardScaler, which subtracts the mean and divides by the standard deviation). This puts all numeric features on the same scale, which matters for distance-based algorithms like logistic regression and tree-based models. For tree-based models like Random Forest and Gradient Boosting, scaling is less critical because they work by splitting features at thresholds, not computing distances. But including it doesn't hurt and makes the pipeline flexible.

For categorical features, it uses one-hot encoding (creating binary columns for each category). When you have a "Geography" field with values France, Germany, Spain, one-hot encoding creates three binary columns: is_France, is_Germany, is_Spain. Each row has exactly one of these set to 1 and the others to 0. This representation is what most ML algorithms expect.

A critical detail: when you fit the preprocessor on training data, it learns statistics (the mean and standard deviation for scaling, the unique categories for one-hot encoding). When you transform test data, you apply these learned statistics without recomputing them. If you recomputed statistics on test data, you'd leak test information into your model. That's a data leak. Pipelines keep this straight by fitting once and transforming multiple times.

Engineer Domain-Specific Features

Raw features work, but domain knowledge gets you further.

python

def engineer_features(df):
    """Create new features from domain knowledge."""
 
    # Tenure-based segmentation
    df['IsNewCustomer'] = (df['Tenure'] < 6).astype(int)  # New = <6 months
    df['IsVeteran'] = (df['Tenure'] > 36).astype(int)     # Veteran = >3 years
 
    # Product diversity (more products = stickier)
    df['ProductDiversity'] = df['NumOfProducts'] > 1
 
    # Financial engagement (balance relative to salary)
    df['BalanceToSalaryRatio'] = df['Balance'] / (df['EstimatedSalary'] + 1)
 
    # Age group (captures non-linear age effects)
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 30, 40, 50, 100],
                             labels=['Young', 'Middle', 'Senior', 'Elderly'])
 
    # Activity score (combine activity and products)
    df['EngagementScore'] = df['IsActiveMember'] * (df['NumOfProducts'] + 1)
 
    return df

Here's the magic of feature engineering. You're not just using raw features; you're creating new ones that capture business logic. Why? Because raw tenure (a number from 0 to 40) doesn't capture the meaningful distinction between "brand new customer" (< 6 months, high churn risk) and "veteran" (> 3 years, sticky). By creating binary flags for these segments, you're telling the model: "This distinction matters."

These features encode assumptions: younger customers behave differently, veterans are sticky, product count matters. You're adding interpretability and potentially improving performance. Many practitioners skip this step and just throw raw features at a black-box model. But the best practitioners think about what their features mean.

The "BalanceToSalaryRatio" feature is particularly interesting. It's not raw balance or raw salary, it's the ratio, which captures a different concept: financial engagement relative to income. A customer with $50,000 balance and $100,000 salary might be more engaged than one with $50,000 balance and $500,000 salary. By creating ratio features, you're forcing the model to learn relationships that require multiplying or dividing features, which linear models can't do naturally.

The "EngagementScore" combines two features: whether the customer is active and how many products they have. An active customer with no products behaves differently from an active customer with multiple products. By multiplying these together, you create an interaction term that might be more predictive than either alone.

This is the art of feature engineering: knowing your domain, understanding what matters to the business, and creating representations that make those patterns explicit. It's not always obvious which features help. Sometimes you'll engineer ten features and only two improve your model. That's fine. The cost of engineering features is low. The cost of missing important patterns is high.

Phase 4: Building and Comparing Models

Now you train multiple models and compare fairly. This is where rigor matters. You won't just train one model and call it done. You'll build three, compare them side-by-side under the same conditions, and pick the winner based on your success metrics.

Why multiple models? Because no algorithm is universally best. Logistic Regression is interpretable but limited in what patterns it can learn. Random Forest handles non-linear relationships and interactions but is harder to debug. Gradient Boosting is powerful but prone to overfitting if not careful. By comparing all three, you make an informed choice based on evidence, not intuition.

Model selection is fundamentally about tradeoffs. Performance versus interpretability. Training speed versus inference speed. Computational cost versus accuracy. Model robustness versus complexity. Your job is to understand these tradeoffs and choose the point on the curve that best serves your business needs. Sometimes the most accurate model isn't the best choice if it can't be deployed, doesn't run fast enough, or can't be explained to stakeholders.

Stratified Train-Test Split

Always stratify by the target when imbalanced.

python

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Maintain churn ratio in train and test
)
 
print(f"Train: {y_train.value_counts(normalize=True)}")
print(f"Test:  {y_test.value_counts(normalize=True)}")

This is a critical step that prevents a subtle but serious bug. If you split data randomly, there's a chance your test set ends up with a different churn rate than your training set. Imagine if your test set happened to have 50% churners when the real rate is 27%. Your model would look artificially good on test data but fail in production. Stratified splitting ensures both train and test sets have the same churn distribution as the original data.

The random_state=42 makes your split reproducible. Anyone running this code gets the same train/test split. This matters for collaboration and debugging. You can't debug if results change every run.

Train Multiple Model Pipelines

You'll compare Logistic Regression, Random Forest, and Gradient Boosting.

python

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_validate
import warnings
warnings.filterwarnings('ignore')
 
# Define models
models = {
    'LogisticRegression': LogisticRegression(
        max_iter=1000,
        random_state=42,
        class_weight='balanced'  # Handle imbalance
    ),
    'RandomForest': RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42,
        class_weight='balanced',
        n_jobs=-1
    ),
    'GradientBoosting': GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )
}
 
# Build pipelines
from sklearn.pipeline import Pipeline
 
pipelines = {}
for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    pipelines[name] = pipeline
 
# Evaluate with nested cross-validation
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import recall_score, precision_score, f1_score, roc_auc_score
 
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
 
results = {}
for name, pipeline in pipelines.items():
    cv_scores = cross_validate(
        pipeline,
        X_train,
        y_train,
        cv=cv,
        scoring={
            'recall': 'recall',
            'precision': 'precision',
            'f1': 'f1',
            'roc_auc': 'roc_auc'
        },
        n_jobs=-1
    )
 
    results[name] = {
        'recall': cv_scores['test_recall'].mean(),
        'precision': cv_scores['test_precision'].mean(),
        'f1': cv_scores['test_f1'].mean(),
        'roc_auc': cv_scores['test_roc_auc'].mean()
    }
 
# Compare results
results_df = pd.DataFrame(results).T
print(results_df.round(3))

This code does something important: it uses stratified 5-fold cross-validation to evaluate each model. Here's why that matters. If you trained on the full training set and evaluated on the same data, you'd see optimistically inflated scores (the model overfits). If you used a simple train-test split without CV, you'd get high variance in your estimates, maybe one random split favors Logistic Regression, another favors Random Forest.

Stratified k-fold CV gives you a robust estimate. You split the training data into 5 folds, maintaining the churn ratio in each. Then you train 5 times, each time using 4 folds for training and 1 for validation, rotating which fold is validation. This gives you 5 estimates of performance, and you average them. It's more expensive computationally but much more reliable.

You also evaluate on multiple metrics, not just accuracy. Accuracy is useless for imbalanced data. Recall tells you what fraction of actual churners you catch. Precision tells you what fraction of your predictions are right. F1 balances both. ROC-AUC measures your ability to rank customers by churn risk. Together, these metrics paint a complete picture.

Output:

                   recall  precision      f1  roc_auc
LogisticRegression   0.687      0.612   0.648    0.816
RandomForest         0.752      0.687   0.717    0.875
GradientBoosting     0.779      0.703   0.738    0.887

Decision: GradientBoosting wins on recall (79%), which is your priority. Logistic Regression is interpretable but lags in performance. Random Forest is a middle ground, good performance, good interpretability, still respectable metrics. In this case, the extra performance from GradientBoosting justifies the loss in interpretability. You're catching 79% of churners with 70% precision, which beats your 80% recall target.

Phase 5: Final Model Selection and Training

Pick GradientBoosting. Train it on all training data. This is different from cross-validation. During CV, you held back a fold to evaluate. Now that you've chosen your champion model, you discard that practice and train on all available training data to maximize the signal your model sees.

python

# Train final model on full training set
final_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(
        n_estimators=150,
        learning_rate=0.08,
        max_depth=6,
        random_state=42
    ))
])
 
final_pipeline.fit(X_train, y_train)
 
# Evaluate on held-out test set
y_pred = final_pipeline.predict(X_test)
y_pred_proba = final_pipeline.predict_proba(X_test)[:, 1]
 
print(f"Test Recall: {recall_score(y_test, y_pred):.3f}")
print(f"Test Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Test F1: {f1_score(y_test, y_pred):.3f}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")

Notice the hyperparameters changed slightly (n_estimators=150, learning_rate=0.08, max_depth=6). You might do a grid search or random search to optimize these on your validation fold. But here, you're training the final version and evaluating on truly unseen test data.

The test set is your only honest evaluation. You've seen everything else, training data, validation folds, even your feature engineering choices influenced by EDA. The test set is fresh. If your test metrics match your CV metrics, you're in good shape. If test metrics are much worse, you overfit or made a mistake somewhere.

Output:

Test Recall: 0.798
Test Precision: 0.714
Test F1: 0.754
Test ROC-AUC: 0.891

Excellent. You're catching 80% of churners with 71% precision. This exceeds your success criterion (80% recall). The model is ready for deployment.

Serialize for Deployment

python

import joblib
 
# Save the trained pipeline
joblib.dump(final_pipeline, '../models/churn_model_v1.pkl')
print("Model saved to models/churn_model_v1.pkl")
 
# Also save the preprocessor separately (useful for debugging)
joblib.dump(preprocessor, '../models/preprocessor_v1.pkl')

Joblib serializes your fitted pipeline to disk. This preserves not just the model weights but the entire pipeline state, the scaler's learned mean and standard deviation, the one-hot encoder's categories, everything. When you load this file later, you can call predict() on new data and get the same transformations applied automatically.

Saving the preprocessor separately is a good practice. If a prediction seems wrong, you can debug by examining what features the preprocessor created for that customer.

Phase 6: Building a FastAPI Inference Endpoint

Your model is trained. Now deploy it. An API wraps your model in a web service that other applications can call. This is where your model transitions from being a data scientist's artifact to a production system used by real applications.

Building an API might seem like overkill for a simple model. Why not just call the model directly from your application code? Several reasons. First, an API decouples your model from the application. If you need to update the model, you deploy a new API version without touching the application code. Second, an API lets different teams use your model without needing Python or data science expertise, they just send HTTP requests. Third, an API enforces a contract (input/output schemas) that prevents bugs. Fourth, it enables monitoring, logging, and auditing of predictions for compliance and debugging.

FastAPI is an excellent choice for ML APIs. It's fast, modern, and automatically generates interactive documentation. When you send a request, FastAPI validates it against your schema and returns a 422 error if it's invalid. This prevents garbage inputs from reaching your model. It also infers types from your Pydantic models and provides sensible defaults. The documentation at /docs is auto-generated and interactive, you can test endpoints directly in the browser.

python

# src/api.py
 
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd
import numpy as np
 
app = FastAPI(title="Churn Prediction API", version="1.0.0")
 
# Load model at startup
model = joblib.load('models/churn_model_v1.pkl')
 
class CustomerData(BaseModel):
    """Input schema for churn prediction."""
    CreditScore: float
    Age: int
    Tenure: int
    Balance: float
    EstimatedSalary: float
    Geography: str
    Gender: str
    HasCrCard: int
    IsActiveMember: int
    NumOfProducts: int
 
class PredictionResponse(BaseModel):
    """Output schema for churn prediction."""
    customer_id: str
    churn_probability: float
    churn_prediction: int
    confidence: float
    action: str
 
@app.get("/health")
def health_check():
    return {"status": "healthy"}
 
@app.post("/predict")
def predict(data: CustomerData):
    """Predict churn for a single customer."""
    try:
        # Convert to DataFrame (required by sklearn pipeline)
        customer_df = pd.DataFrame([data.dict()])
 
        # Make prediction
        churn_pred = model.predict(customer_df)[0]
        churn_prob = model.predict_proba(customer_df)[0, 1]
 
        # Determine action based on probability
        if churn_prob > 0.6:
            action = "URGENT: Outreach immediately"
        elif churn_prob > 0.4:
            action = "STANDARD: Schedule retention call"
        else:
            action = "MONITOR: Low risk, no action needed"
 
        return PredictionResponse(
            customer_id="CUST_001",
            churn_probability=float(churn_prob),
            churn_prediction=int(churn_pred),
            confidence=max(churn_prob, 1 - churn_prob),
            action=action
        )
 
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))
 
@app.post("/batch_predict")
def batch_predict(customers: list[CustomerData]):
    """Predict churn for multiple customers."""
    customers_df = pd.DataFrame([c.dict() for c in customers])
    probs = model.predict_proba(customers_df)[:, 1]
    return {"predictions": probs.tolist()}

This API does several important things. It defines input and output schemas using Pydantic. These enforce type checking and provide automatic documentation. It loads the model once at startup (not on every request, that would be slow). It wraps predictions in a try-except to handle errors gracefully. It provides not just the prediction but also the probability and a recommended action based on the probability threshold.

The Pydantic models (CustomerData and PredictionResponse) are contracts. They define exactly what fields the API accepts and what it returns. If someone sends a request missing a required field or with the wrong type, FastAPI automatically rejects it with a 422 error before your code even runs. This prevents bad data from reaching your model. It also makes the API self-documenting, anyone reading the code immediately understands the input/output format.

The action logic is crucial: you're not just returning a binary prediction (churn or not). You're saying "this customer needs urgent outreach" or "monitor this one." This transforms a model output into something the business can act on immediately. The thresholds (0.6 for urgent, 0.4 for standard) are business decisions. A retention specialist might look at that 65% churn probability and decide it's not worth their time to call. So you use 0.6 as the threshold for "urgent." Customers between 0.4 and 0.6 get a more passive approach. Those below 0.4 aren't contacted. These thresholds should be tuned based on the business's outreach capacity and success rates, not just the model.

Run it:

bash

uvicorn src.api:app --reload --port 8000

Visit http://localhost:8000/docs to see interactive Swagger docs. You can test the endpoint right there in your browser. This is a huge advantage of FastAPI, built-in interactive documentation and testing. You'll also see schemas for every request/response, helping clients integrate with your API. If your API changes (you add a new field), the docs update automatically.

Phase 7: Testing Preprocessing and Predictions

Production code needs tests. A model that works in a notebook might break in production due to subtle data issues, missing features, or type mismatches. Testing isn't optional when you're shipping code that affects business decisions. Bad predictions from an untested model could mislead the business into making poor retention decisions.

The philosophy here is simple: test the boring stuff. You don't need sophisticated tests for your GradientBoosting model (scikit-learn's internals are already tested). You need tests for the bridges between stages, preprocessing, serialization, deserialization, API validation. That's where bugs hide. Test that your preprocessor maintains the same learned statistics when reloaded. Test that your API rejects invalid input. Test that your model outputs are in expected ranges. These mundane tests catch the issues that cause production fires.

Unit tests in ML projects follow the same principle as elsewhere: test one thing at a time in isolation. You test preprocessing separately from inference. You test the API separately from the model. This modularity makes debugging easier. When a test fails, you know exactly where the problem is.

python

# tests/test_preprocessing.py
 
import pytest
import pandas as pd
import numpy as np
from src.preprocessing import create_preprocessor
 
@pytest.fixture
def sample_data():
    """Create sample input data."""
    return pd.DataFrame({
        'CreditScore': [600, 750, 800],
        'Age': [25, 45, 65],
        'Tenure': [2, 15, 30],
        'Balance': [5000, 50000, 100000],
        'EstimatedSalary': [50000, 80000, 120000],
        'Geography': ['France', 'Germany', 'Spain'],
        'Gender': ['Male', 'Female', 'Male'],
        'HasCrCard': [1, 1, 0],
        'IsActiveMember': [1, 0, 1],
        'NumOfProducts': [1, 2, 3]
    })
 
def test_preprocessor_output_shape(sample_data):
    """Test preprocessor produces correct shape."""
    preprocessor, _, _ = create_preprocessor()
    X = preprocessor.fit_transform(sample_data)
    assert X.shape[0] == 3  # 3 samples
    assert X.shape[1] > 0   # Has features
 
def test_numeric_scaling(sample_data):
    """Test numeric features are scaled."""
    preprocessor, numeric_cols, _ = create_preprocessor()
    X = preprocessor.fit_transform(sample_data)
    # Scaled values should be roughly in [-3, 3]
    assert np.all(np.abs(X[:, :len(numeric_cols)]) < 5)
 
def test_no_nan_in_output(sample_data):
    """Test output has no NaN values."""
    preprocessor, _, _ = create_preprocessor()
    X = preprocessor.fit_transform(sample_data)
    assert not np.isnan(X).any()

These tests ensure your preprocessing doesn't silently break. They're simple but critical. The first test checks that the output has the right shape. The second verifies that numeric features are actually scaled (values in a reasonable range). The third ensures no NaN values sneak in. Together, they catch the most common preprocessing bugs.

Run tests:

bash

pytest tests/test_preprocessing.py -v

Test Predictions

python

# tests/test_api.py
 
from fastapi.testclient import TestClient
from src.api import app
 
client = TestClient(app)
 
def test_health():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"
 
def test_predict():
    payload = {
        "CreditScore": 700,
        "Age": 35,
        "Tenure": 10,
        "Balance": 50000,
        "EstimatedSalary": 75000,
        "Geography": "France",
        "Gender": "Male",
        "HasCrCard": 1,
        "IsActiveMember": 1,
        "NumOfProducts": 2
    }
    response = client.post("/predict", json=payload)
    assert response.status_code == 200
    data = response.json()
    assert 0 <= data["churn_probability"] <= 1
    assert data["churn_prediction"] in [0, 1]
 
def test_invalid_input():
    bad_payload = {"Age": "not_a_number"}
    response = client.post("/predict", json=bad_payload)
    assert response.status_code == 422  # Validation error

These tests verify your API works end-to-end. The health check test ensures the server is running. The predict test sends valid data and checks that the response has the right structure and values in valid ranges. The invalid input test confirms that invalid data is rejected with a proper HTTP status code (422 is validation error).

These tests are simple, but they catch errors that would cause production outages. Running them before deployment takes seconds and prevents hours of debugging.

Phase 8: Project Structure and Documentation

Organize for reproducibility. When you come back to this project in six months, you should be able to understand every decision. The irony of data science is that the hardest part isn't the science, it's the discipline required to ship working systems.

Project structure matters because ML projects are messy. You have raw data, processed data, training artifacts, model checkpoints, notebooks at various stages of completion, and dozens of small utilities. Without structure, this chaos grows. Six months later, you're digging through your hard drive trying to remember which version of the preprocessor you used, which features you engineered, and why you chose those hyperparameters. A well-organized project prevents this. It says: "Here's the raw data, here's the processed data, here's how we got from A to B, here's the final model, here's how to run it."

Documentation is the often-overlooked secret weapon. A perfectly organized codebase with zero documentation is still useless. No one knows how to use it. A slightly messier codebase with clear documentation is more useful. Document not just how to run your code but why you made decisions. Why did you choose GradientBoosting over Random Forest? Why did you engineer these specific features? Why did you split the data stratified rather than random? Future you will be grateful.

churn-prediction/
├── data/
│   ├── raw/
│   │   └── bank_churn.csv
│   └── processed/
│       └── X_train.pkl
├── models/
│   ├── churn_model_v1.pkl
│   └── preprocessor_v1.pkl
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_training.ipynb
├── src/
│   ├── preprocessing.py
│   ├── api.py
│   └── utils.py
├── tests/
│   ├── test_preprocessing.py
│   └── test_api.py
├── requirements.txt
├── README.md
└── config.yaml

This structure is standard. Raw data is separate from processed. Models are versioned. Notebooks are sequential and numbered. Source code is modular. Tests are in their own directory. This organization makes the project easy to navigate for you and anyone else who touches it.

Create a README for Reproduction

Your README should explain what the project does, how to set it up, what performance to expect, and how to deploy it. Future you (or a colleague) can read this and be productive within minutes. Here's what a good README includes:

# Bank Churn Prediction

Predicts which customers will churn within 30 days using gradient boosting.

## Setup

pip install -r requirements.txt
jupyter notebook notebooks/01_eda.ipynb

## Model Performance

- Recall: 79.8% (catches 80% of churners)
- Precision: 71.4% (71% of predictions are correct)
- ROC-AUC: 0.891

## Deploy

uvicorn src.api:app --host 0.0.0.0 --port 8000

This structure immediately tells future readers what the project is, how to run it, and what to expect for performance. No mysteries, no surprises.

Common Mistakes to Avoid

Before we wrap up, let's talk about the pitfalls that trap experienced programmers when they first enter ML. These mistakes are subtle because they often don't cause errors, they cause silent failures where your model appears to work but produces bad results.

Data leakage is the most dangerous. You accidentally let information from the test set influence your training. Examples include scaling the entire dataset before splitting (test data influences the scaler's statistics), engineering features using global statistics, or directly using the target variable to create features. Always split first, then process. Always maintain the barrier between train and test.

Treating all imbalanced classes equally is another common mistake. If your dataset is 95% negative and 5% positive, and you use accuracy as your metric, a model that always predicts negative gets 95% accuracy while catching zero positives. This is why we used stratified splitting and multiple metrics (recall, precision, F1, ROC-AUC) instead of relying solely on accuracy.

Hyperparameter tuning on your test set is subtle but deadly. If you test multiple hyperparameter combinations and pick the one that performs best on your test set, you're overfitting to your test set. That's why we used cross-validation on the training set to select hyperparameters, then evaluated once on truly held-out data.

Assuming your model will work forever is optimistic but naive. Data distributions change. Customer behavior evolves. Seasonal effects emerge. A model trained on 2023 data might perform terribly on 2025 data. Build monitoring and retraining into your system from day one.

Deploying without considering fairness and bias. Your model might perform well on average but terribly for specific demographic groups. This is especially important for consequential decisions like credit, hiring, or criminal justice. Test your model's performance across different subgroups. If you find disparities, fix them before deployment.

Putting It All Together: The Full Workflow

Here's what you actually do in practice. This is the checklist you return to for every ML project:

Define success with stakeholders (not just accuracy, business metrics)
Explore data systematically (distributions, correlations, missing values, biases)
Engineer features using domain knowledge (tenure groups, engagement scores)
Split data strategically (train/validation/test with stratification)
Compare models fairly with nested cross-validation (5-fold CV, multiple metrics)
Select the winner based on your success criteria (not just accuracy)
Train once on all training data, evaluate on held-out test set
Serialize the pipeline (joblib) including preprocessing
Build an API that wraps your model with proper schemas and error handling
Test everything (preprocessing, predictions, edge cases, fairness)
Document for reproduction (README, requirements, config, decision log)
Monitor in production (track performance, detect data drift, plan retraining)

This isn't just a machine learning project, it's a software engineering project with a machine learning component. The ML part (training a model) is maybe 20% of the work. The other 80% is infrastructure, testing, documentation, and thinking clearly about what success looks like.

You'll notice that the steps are sequential but also iterative. You might discover during EDA that your original success metrics need adjusting. You might find that your engineered features don't help, so you engineer different ones. You might start with Logistic Regression, realize it's too simple, then try ensemble methods. This iteration is normal. What matters is that you iterate systematically, with evidence, not randomly or by intuition alone.

A practical tip: set up your project early. Create the directory structure, initialize version control, write a README, and set up a requirements.txt file before you write a single line of analysis code. It takes 10 minutes and saves hours of organization headaches later. Use consistent naming conventions (snake_case for files and variables, PascalCase for classes). Add docstrings to every function explaining inputs, outputs, and assumptions. Future you will be grateful.

Why This Approach Matters

Reproducibility: Your entire workflow is documented and repeatable. A year from now, you can rebuild the model from scratch. You have the raw data, the preprocessing code, the model weights, and the test cases. Every decision is auditable. This is crucial when your model makes decisions that affect people. If someone asks "Why did you make this prediction?" you can trace through your exact procedure and explain it.

Rigorous Comparison: Stratified splits, proper cross-validation, and held-out test sets prevent data leakage and overoptimism. You're not overfitting and fooling yourself into thinking your model works better than it does. Comparing models under identical conditions (same train/test split, same cross-validation scheme, same metrics) means you're measuring real differences, not artifacts of your evaluation procedure.

Interpretability: You know why you chose GradientBoosting and what drove decisions. You can explain predictions to stakeholders. You engineered features with domain knowledge, so you understand what the model is learning. This matters for trust. When you tell the business "this customer is 78% likely to churn because they're new, have low tenure, and low engagement," you can point to real features and explain the reasoning.

Deployability: Your API is tested, validated, and ready for production. No surprises when users call it. You've thought through edge cases, error handling, and schema validation. You have a health check endpoint so monitoring systems can detect when your API goes down.

Maintainability: Tests catch regressions if someone modifies code later. Documentation explains why things work. Others can modify and extend your code without breaking it. That's the difference between a one-off analysis and a proper system. It's the difference between something that solves a problem today and something that creates value for years.

Real-World Complications You'll Encounter

The project we've walked through is idealized. It assumes clean data, balanced classes, and stable behavior. Real projects are messier. Let's talk about what you'll actually face.

One critical piece we haven't discussed: versioning. As you iterate, you'll create multiple model versions (model_v1, model_v2, etc.). Each version has different hyperparameters, different feature engineering choices, potentially different training data. You need to track which version performs best and which is currently in production. Some practitioners use model registries like MLflow to manage this. Others maintain a simple spreadsheet. Regardless of the tool, the discipline matters. Being able to answer "which model is deployed right now?" should never require digging through your code.

Your data will have issues. Missing values will appear in unexpected places. Outliers will exist. Categorical variables will have thousands of unique values instead of three. Column names will change between data exports. You'll spend more time cleaning and validating data than you expect. Get comfortable with this. Data wrangling is 80% of real ML work.

Classes will be imbalanced. Your churn rate might be 1% instead of 27%. When you have 10,000 non-churners and 100 churners, a model that always predicts "no churn" gets 99% accuracy while being useless. Techniques like SMOTE (Synthetic Minority Oversampling Technique), class weights, and threshold adjustment become essential. You'll learn these techniques in more advanced courses.

Your model will drift. Six months after deployment, performance degrades. Why? Customer behavior changed. The market shifted. Seasonality effects you didn't account for emerged. This is data drift and model drift, and it's why monitoring production models is critical. Build logging into your API from day one.

Stakeholders will ask for things your model can't do. "Can you predict churn three months out instead of one month?" Sure, but the accuracy drops because it's harder. "Can you explain why John's churn probability is 72%?" You can approximate explanations using SHAP values or feature importance, but true interpretability is hard with complex models. These aren't failures of your model, they're limitations of the problem.

You'll face tradeoffs with no perfect answer. Should you optimize for recall (catch more churners) or precision (avoid wasting outreach on non-churners)? Should you retrain the model weekly or monthly? Should you use a simpler model that runs fast but has lower accuracy, or a complex model that's more accurate but slow? These are business decisions, not technical ones. Your job is to quantify the tradeoffs so decision-makers can choose.

Measuring Success in Production

Once your model is live, success metrics change. In development, you optimized for recall (catching churners). In production, you measure business impact: Did predictions lead to successful interventions? What's the retention lift? What's the cost per prevented churn?

You'll also monitor for data drift and model degradation. Set up automated alerts: if your model's accuracy drops below a threshold, if prediction latency exceeds an SLA, or if the distribution of inputs changes dramatically. Many teams neglect monitoring and are shocked when a model performs badly for weeks before anyone notices. Build observability into your system from day one. Your future self will thank you.

What's Next

You've built a complete ML system. You have a model in production, an API serving predictions, tests ensuring correctness, and documentation that lets others reproduce your work. This is what production machine learning looks like.

The gap between "I trained a model in a notebook" and "I shipped a model to production" is the gap we've closed in this article. You've learned that the real skill isn't in the algorithms, those are tools. The real skill is in thinking rigorously about problems, managing data carefully, comparing fairly, and building systems that work. These principles apply whether you're predicting churn, classifying images, or estimating prices.

Now it's time for deeper learning. The next cluster introduces neural networks and deep learning, where you'll apply these same principles to problems that require millions of parameters and weeks of training. But the fundamentals don't change. You'll still define success metrics, still split data carefully, still test your code. You'll just do it with more complex models.

You've completed the ML Foundations cluster. You know how to think like a machine learning engineer: define problems rigorously, explore data scientifically, compare fairly, and ship code that works. Everything else builds on this foundation.

End-to-End Machine Learning Project

The Project: Bank Customer Churn Prediction

Phase 1: Problem Definition and Success Metrics

Define Success with Business Partners

Define Data Success Criteria

Phase 2: Data Acquisition and Exploratory Data Analysis

Load and Profile Your Data

Visualize Feature Relationships

Check Correlations

Phase 3: Feature Engineering and Data Preparation

Build a Robust Preprocessing Pipeline

Engineer Domain-Specific Features

Phase 4: Building and Comparing Models

Stratified Train-Test Split

Train Multiple Model Pipelines

Phase 5: Final Model Selection and Training

Serialize for Deployment

Phase 6: Building a FastAPI Inference Endpoint

Phase 7: Testing Preprocessing and Predictions

Test Predictions

Phase 8: Project Structure and Documentation

Create a README for Reproduction

Common Mistakes to Avoid

Putting It All Together: The Full Workflow

Why This Approach Matters

Real-World Complications You'll Encounter

Measuring Success in Production

What's Next

Need help implementing this?