scikit-learn Pipelines and Column Transformers

There's a quiet bug that hides in machine learning projects for weeks before it surfaces, sometimes right before a demo, sometimes only after deployment. You built something that worked beautifully in your notebook: clean data, solid accuracy, consistent results across validation sets. Then you ship it. Maybe you retrain on fresh data and deploy again. And slowly, or sometimes suddenly, your metrics degrade. The model that performed so well in development starts misbehaving in production, and nobody can figure out why.
The culprit, more often than you'd expect, is inconsistency between how you preprocessed data during training versus how you're preprocessing it when making predictions. You scaled your features one way at training time and forgot to apply that exact same transformation when new data arrives. Or you accidentally let information about your test set "leak" into the transformation parameters you computed during model building. These aren't exotic bugs, they're easy mistakes, and they're the exact reason scikit-learn Pipelines exist.
A Pipeline is not just a convenience wrapper. It's a contract: a declaration that these preprocessing steps and this model are a single indivisible unit. When you train that unit, every transformation learns from training data only. When you make predictions, those same transformations, with their exact fitted parameters, are applied automatically and consistently. The bug described above becomes structurally impossible to introduce.
And when your data mixes types, numeric columns that need scaling sitting alongside categorical columns that need encoding, the ColumnTransformer extends this contract. Instead of managing two separate transformation workflows and hoping you apply them correctly and in the right order, you declare them together. One object handles all of it, fusing the results into a unified feature matrix ready for your model.
In this article, we'll build that understanding from the ground up. We'll start with the data leakage problem that motivates Pipelines, learn how to construct them from simple to complex, explore ColumnTransformer patterns for mixed data, write custom transformers for when built-in options fall short, and cover the mistakes that trip up even experienced practitioners. By the end, you'll have both the conceptual foundation and the practical patterns you need to build reproducible, production-safe machine learning workflows.
Table of Contents
- Why Pipelines Matter: The Data Leakage Problem
- Why Pipelines Prevent Data Leakage
- Building Your First Pipeline
- Multiple Transformations: ColumnTransformer
- Why Separate by Column? Data Leakage, Again
- Column Transformer Patterns
- Make it Shorter: make_column_transformer
- Custom Transformations with FunctionTransformer
- Custom Transformers
- Pandas Output: Keeping Column Names
- Serialization: Save and Load Pipelines
- Hyperparameter Tuning Through Pipelines
- Common Pipeline Mistakes
- Debugging and Introspection
- Practical Example: End-to-End
- Advanced Patterns: Pipelines in Real Workflows
- Parallel Processing with n_jobs
- Pipeline Cloning and Cross-Validation
- Conditional Pipelines with ColumnTransformer Remainder
- Feature Selection Inside Pipelines
- Performance: When to Use Pipelines vs. Manual Preprocessing
- Versioning and Reproducibility
- Real-World Gotcha: OneHotEncoder and Unknown Categories
- Exporting Pipelines to Production Formats
- Conclusion
Why Pipelines Matter: The Data Leakage Problem
Before we talk about solutions, let's look at what goes wrong without them.
Imagine you're working with a dataset of house prices. You have 1,000 rows. You split 80/20: 800 train, 200 test. Then you scale your features using StandardScaler.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.random.randn(1000, 5)
y = np.random.randn(1000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# ❌ WRONG: Fit scaler on both train + test
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on train
X_test_scaled = scaler.transform(X_test) # Transform test (good!)Wait, that looks right. Let's dig deeper.
The real problem emerges when you fit the scaler on the entire combined dataset before splitting. This is the mistake that shows up most often in beginner and intermediate code, and it's subtle enough that experienced practitioners have shipped it to production more times than anyone likes to admit.
# ❌ REAL PROBLEM: Fit scaler on full data, then split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fit on ALL 1000 rows (includes test info!)
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2
)Now your test set has been "seen" by the scaler. The mean and std of your test data influenced the scaling parameters. Your test set metrics will be artificially optimistic, you've leaked information from test to training.
This leak is subtle but deadly in production. You retrain monthly, apply the new scaler to last month's data, and your metrics drop. You did something "wrong", no, you did something different. You were inconsistent.
Pipelines solve this by locking the transformation parameters at training time and reusing them for prediction.
Why Pipelines Prevent Data Leakage
This deserves its own focused discussion, because the mechanism is worth understanding precisely, not just trusting that it works.
When you call pipeline.fit(X_train, y_train), scikit-learn executes each step sequentially on the training data only. The StandardScaler learns the mean and standard deviation from X_train. It never sees X_test. The parameters it learns, scaler.mean_ and scaler.scale_, are now fixed. They are a property of the training distribution, not the full dataset.
When you later call pipeline.predict(X_test) or pipeline.transform(X_test), the scaler applies those frozen training-time parameters to your test data. The test data gets transformed using the training distribution's statistics. This is exactly what you want: your test set is being treated as if it were genuinely new, unseen data, because from the scaler's perspective, it is.
The leakage that happens outside of Pipelines occurs because nothing enforces that separation. You can call scaler.fit(X_full) before splitting, and Python won't warn you. You can forget to apply the scaler at all during prediction, and your code will still run and produce numbers. Pipelines make the correct behavior the default behavior. The structure of the code prevents the error at the architectural level rather than relying on you to remember it correctly every time.
This matters enormously during cross-validation. When GridSearchCV splits your data into folds, it clones your pipeline fresh for each fold. Each fold's preprocessor fits only on that fold's training portion. If you used manual preprocessing before passing data to GridSearchCV, your preprocessing would already have seen all folds, your validation scores would be optimistically biased, and you would likely choose a worse model while believing you chose a better one. Pipelines ensure your cross-validation estimates are honest.
Building Your First Pipeline
The simplest way to create a Pipeline is make_pipeline. It requires no step naming, scikit-learn generates names automatically from the class names, and it reads cleanly from left to right, reflecting the actual order of operations in your workflow.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# Create a pipeline: Scale → Classify
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
# Train (fit and transform happen automatically)
pipe.fit(X, y)
# Predict (transform automatically applied)
predictions = pipe.predict(X[:10])
print(predictions)That one-line pipeline definition encapsulates a workflow that used to require three to five separate statements, each with its own opportunity for error or inconsistency.
Behind the scenes, here's what happens:
fit(X, y): StandardScaler fits on X, learns mean and std. Then LogisticRegression fits on the scaled X.predict(X_new): StandardScaler transforms X_new using the parameters learned during fit. LogisticRegression predicts on the transformed data.
The scaler is locked in. You can't accidentally apply a different scaler in production.
Once your pipeline is fitted, you can inspect what it learned. This is useful for debugging and for understanding whether your transformations behaved as expected on your particular dataset.
# Get the fitted scaler from the pipeline
scaler = pipe.named_steps['standardscaler']
print(scaler.mean_)
print(scaler.scale_)Ah, but this uses generic names. Better to use Pipeline and give steps explicit names. Explicit names make your code self-documenting and make debugging much easier when you have five or more steps in a complex pipeline.
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000))
])
pipe.fit(X, y)
# Now we can access by name
scaler = pipe.named_steps['scaler']
print(f"Scaler mean: {scaler.mean_[:5]}")
print(f"Scaler scale: {scaler.scale_[:5]}")Named steps are huge for debugging. You can inspect each step, check what it learned, and introspect your pipeline like a black box that suddenly became transparent.
Multiple Transformations: ColumnTransformer
Real-world datasets don't have one data type. You might have:
- Numeric columns: age, income, years employed
- Categorical columns: education, job title, region
- Boolean columns: is_customer, has_loan
You need different transformations for each:
- Numeric: StandardScaler, PolynomialFeatures
- Categorical: OneHotEncoder, OrdinalEncoder
- Boolean: Leave as-is (or encode to 0/1)
ColumnTransformer applies different transformers to different column groups, then combines the results. Think of it as a routing layer that sends each column to the right transformation, then assembles all the results into a single feature matrix.
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Sample data
data = pd.DataFrame({
'age': [25, 35, 45, 55],
'income': [50000, 75000, 100000, 120000],
'education': ['HS', 'Bachelors', 'Masters', 'PhD'],
'employed': [1, 1, 1, 0]
})
# Define column groups
numeric_features = ['age', 'income']
categorical_features = ['education']
# Create transformers for each group
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Combine them
column_transformer = ColumnTransformer([
('numeric', numeric_transformer, numeric_features),
('categorical', categorical_transformer, categorical_features)
])
# Use in a pipeline
pipe = Pipeline([
('preprocessing', column_transformer),
('classifier', LogisticRegression(max_iter=1000))
])
X = data.drop('employed', axis=1)
y = data['employed']
pipe.fit(X, y)
print(pipe.predict(X[:1]))The indirection here, defining column groups explicitly before building the transformer, pays off when your datasets grow. When you add or remove features, you update the lists, and the transformer structure follows automatically without requiring you to recount column indices.
Let's trace through what happens:
- ColumnTransformer sees the input dataframe.
- It applies StandardScaler to
['age', 'income']. - It applies OneHotEncoder to
['education']. - It concatenates the results horizontally (side by side).
- The combined scaled-and-encoded array flows to LogisticRegression.
The beauty? When you call pipe.predict(new_data), the ColumnTransformer applies the exact same transformations it learned during fit. No leakage. No surprises.
Why Separate by Column? Data Leakage, Again
You might think, "Why not just apply StandardScaler to all columns? It'll ignore non-numeric columns anyway."
Actually, no. StandardScaler on mixed types fails:
# ❌ FAILS
scaler = StandardScaler()
scaler.fit(data) # Error: can't scale stringsBut more subtly: what if you try to force numeric encoding?
# ❌ WRONG APPROACH
# Apply OneHotEncoder first to everything
# Then scale everything
# Now the one-hot encoded columns (0/1) get scaled to weird ranges
# And you're mixing two different transformations of the same dataWith ColumnTransformer, you're explicit: these columns get this transformation. No guessing. No mixing. Just clean separation of concerns.
The column-level specificity also makes your preprocessing auditable. When a stakeholder asks "how did you handle the education column?", you have an exact, code-level answer rather than a vague description of a sequence of manual steps.
Column Transformer Patterns
There are several patterns worth knowing for ColumnTransformer that come up repeatedly in real projects. Mastering them will cover the majority of preprocessing scenarios you encounter.
The first pattern is nested pipelines within a ColumnTransformer. Sometimes a column group needs multiple transformations applied in sequence, not just one. You handle this by nesting a Pipeline as the transformer for that group. Numeric columns frequently need imputation before scaling, missing values would break StandardScaler, so you chain them together. The outer ColumnTransformer routes to the right group, and the inner Pipeline applies transformations in order within that group.
The second pattern is the remainder option. Every column in your dataframe must be accounted for. By default, ColumnTransformer drops any column not explicitly assigned to a transformer group. This is often the right behavior, you don't want unlisted columns sneaking into your feature matrix, but sometimes you have columns you want to pass through unchanged. Setting remainder='passthrough' includes all unspecified columns at the end of the output matrix without any transformation. This is useful for boolean columns or already-scaled features that need no preprocessing.
The third pattern is using column selectors instead of lists. For large dataframes, listing every column name by hand is error-prone and brittle. Scikit-learn provides make_column_selector to select columns by dtype automatically. make_column_selector(dtype_include=np.number) selects all numeric columns, while make_column_selector(dtype_include=object) selects all object (string) columns. If your column types are set correctly, which they should be as part of your data cleaning process, these selectors update automatically when you add new columns to your dataset.
The fourth pattern is wrapping ColumnTransformer in a function. When you build multiple models for the same dataset, you often want the same preprocessing logic. Wrapping ColumnTransformer construction in a function that accepts feature lists as parameters keeps your preprocessing DRY and lets you experiment with different column assignments without duplicating code.
Make it Shorter: make_column_transformer
For quick pipelines, use make_column_transformer. It removes the need to name each transformer explicitly, letting you declare transformations inline in a compact form that's easy to scan.
from sklearn.compose import make_column_transformer
column_transformer = make_column_transformer(
(StandardScaler(), numeric_features),
(OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)
)
pipe = Pipeline([
('preprocessing', column_transformer),
('classifier', LogisticRegression(max_iter=1000))
])Or even faster, combine both into one step with make_pipeline. This ultra-compact form is excellent for prototyping, where you want to get a baseline model running quickly and iterate from there.
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(
make_column_transformer(
(StandardScaler(), numeric_features),
(OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)
),
LogisticRegression(max_iter=1000)
)It's concise, readable, and does exactly what you need.
Custom Transformations with FunctionTransformer
Sometimes you need a transformation that scikit-learn doesn't provide. Extract the month from a date. Compute a ratio of two columns. Apply a custom formula. Real datasets have quirks that generic transformers can't anticipate, and a good pipeline system needs to accommodate them without abandoning its consistency guarantees.
FunctionTransformer wraps your custom function into a transformer that fits the pipeline framework:
from sklearn.preprocessing import FunctionTransformer
import numpy as np
# Custom transformation: log of income + 1 (to avoid log(0))
def log_transform(X):
return np.log1p(X)
log_transformer = FunctionTransformer(log_transform, validate=True)
# Use in pipeline
pipe = Pipeline([
('log', log_transformer),
('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000))
])
# Example
X_sample = np.array([[100], [1000], [10000]])
pipe.fit(X_sample, np.array([0, 1, 1]))
print(pipe.predict([[5000]]))The validate=True argument checks that input and output are 2D arrays. Your function must accept an array-like input and return an array-like output. It can be as simple as a log transform or as complex as a feature interaction computation, as long as it maps arrays to arrays without learning from data, FunctionTransformer handles it cleanly.
If you need to learn parameters (like the mean for centering), write a custom class inheriting from BaseEstimator and TransformerMixin. This is the correct pattern whenever your transformation needs to fit on training data and then apply those learned parameters to new data, the exact same contract that all scikit-learn transformers follow.
from sklearn.base import BaseEstimator, TransformerMixin
class CustomScaler(BaseEstimator, TransformerMixin):
def __init__(self, offset=0):
self.offset = offset
def fit(self, X, y=None):
self.mean_ = np.mean(X, axis=0)
return self
def transform(self, X):
return X - self.mean_ + self.offset
# Use exactly like any sklearn transformer
pipe = Pipeline([
('custom', CustomScaler(offset=100)),
('classifier', LogisticRegression(max_iter=1000))
])Now your custom transformer learns from training data and applies consistent transformations to new data.
Custom Transformers
Custom transformers deserve a fuller treatment, because they're the key to making Pipelines practical for real-world preprocessing rather than just textbook examples.
The BaseEstimator and TransformerMixin combination is the standard recipe. BaseEstimator gives you get_params() and set_params() for free, which means your custom transformer works with GridSearchCV and clone() out of the box. TransformerMixin gives you fit_transform() for free by combining fit() and transform(). You only need to implement three methods: __init__, fit, and transform.
One rule that catches people: store all constructor arguments as instance attributes with the exact same names in __init__. This is required for get_params() to work correctly, which in turn is required for hyperparameter search. If you write self.thresh = threshold when your parameter is named threshold, GridSearchCV will silently fail to tune it. Write self.threshold = threshold and it works.
Another rule: fit must return self. This is what enables the method chaining pattern transformer.fit(X).transform(X), and it's required for fit_transform to work. Forgetting to return self produces confusing NoneType has no attribute transform errors that can take a while to trace back to their source.
Custom transformers also support inverse_transform if your transformation is reversible and you want to be able to decode predictions back to the original scale. This is particularly useful in regression pipelines where your target was log-transformed and you want to report predictions in the original units.
When your custom transformer works on specific columns of a DataFrame, extracting a date component, for instance, use ColumnTransformer to route those columns to your transformer while leaving others unaffected. This keeps your custom logic isolated and your preprocessing architecture clean.
Pandas Output: Keeping Column Names
By default, transformers output numpy arrays. You lose column names. In pandas workflows, that's painful. When you're debugging or doing feature importance analysis, anonymous array columns make it much harder to understand what your model actually learned.
Scikit-learn 1.2+ introduced set_output(transform='pandas'):
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
column_transformer = ColumnTransformer([
('numeric', StandardScaler(), numeric_features),
('categorical', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_features)
])
# Output dataframes, not arrays
column_transformer.set_output(transform='pandas')
X_transformed = column_transformer.fit_transform(X)
print(type(X_transformed)) # <class 'pandas.core.frame.DataFrame'>
print(X_transformed.columns) # See your column names!The transformed output keeps pandas column names (generated by OneHotEncoder with the original categorical values). You can still use the pipeline normally, it just works with dataframes internally. This is one of the most practical quality-of-life improvements in recent scikit-learn versions, and it's worth adopting immediately if you're on 1.2 or later.
Serialization: Save and Load Pipelines
You've trained a pipeline. Now you need to use it in production. You could retrain every time, wasteful. Better to serialize it. Serialization is also what enables you to separate model training from model serving, which is a fundamental pattern in production ML systems.
Use joblib (included with scikit-learn):
import joblib
# Train
pipe.fit(X_train, y_train)
# Save
joblib.dump(pipe, 'my_pipeline.pkl')
# Load in production
loaded_pipe = joblib.load('my_pipeline.pkl')
# Predict
predictions = loaded_pipe.predict(X_new)joblib handles all the numpy arrays, sklearn objects, and learned parameters. It's robust and fast.
One caveat: if your pipeline includes FunctionTransformer with a lambda or a locally-defined function, joblib might fail to serialize it. Stick to module-level functions or custom classes. This is a common gotcha when you write a quick pipeline in a notebook, lambdas are convenient interactively but will break when you try to save and reload the model later.
# ❌ Joblib can't serialize this
transformer = FunctionTransformer(lambda x: x ** 2)
# ✅ Joblib can serialize this
def my_square(x):
return x ** 2
transformer = FunctionTransformer(my_square)Hyperparameter Tuning Through Pipelines
Remember that __ syntax in GridSearchCV? It's made for pipelines.
When you have a Pipeline, every step's parameters are accessible with step_name__parameter_name. This double-underscore convention lets GridSearchCV navigate into nested objects, into the pipeline, into a specific step, and then to the parameter of that step. It's verbose but explicit, and it enables you to tune not just your model's hyperparameters but your preprocessing hyperparameters at the same time, searching for the combination that works best on your actual data.
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('poly', PolynomialFeatures()),
('scaler', StandardScaler()),
('ridge', Ridge())
])
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': [0.1, 1.0, 10.0]
}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")GridSearchCV tries all combinations:
PolynomialFeatures(degree=1)withRidge(alpha=0.1)PolynomialFeatures(degree=1)withRidge(alpha=1.0)- ... and so on.
Each fold:
- Trains PolynomialFeatures and Ridge on the training fold.
- Predicts on the validation fold.
- Computes the score.
The pipeline ensures transformations fit only on training data, never on validation data. No leakage across folds. This is the point where the earlier investment in building a proper pipeline pays off in full: you can tune aggressively, exploring many combinations, and trust that your validation scores are honest estimates of generalization performance rather than artifacts of leakage.
When you're done tuning, grid.best_estimator_ is your fully-trained, tuned pipeline. Use it directly:
final_predictions = grid.best_estimator_.predict(X_test)Common Pipeline Mistakes
Even when you know Pipelines well, specific patterns trip people up. These are the mistakes we see most often, collected from real code reviews and debugging sessions.
The first and most common mistake is putting preprocessing outside the pipeline. You apply imputation or encoding to your whole dataset, then pass the result into a Pipeline. At that point you've already done the transformation manually, and the pipeline offers you nothing, except false confidence that you're protected from leakage when you're actually not. All preprocessing that uses training data statistics must live inside the Pipeline, or you lose the leakage protection entirely.
The second mistake is misusing fit_transform versus fit plus transform. On training data, fit_transform is correct and slightly more efficient than calling both separately. On test data or new data, you must call transform only. If you call fit_transform on your test set, you refit the scaler on the test distribution, that's leakage. Pipelines handle this automatically when you use pipeline.fit and pipeline.predict, but if you call steps manually or extract transformers from the pipeline, you need to be careful about which method you call.
The third mistake is forgetting handle_unknown='ignore' on OneHotEncoder. During training, your encoder learns a fixed vocabulary of categories. If a category appears at prediction time that wasn't in the training set, a new product line, a new region, a new customer tier, the default encoder will raise an error and crash your serving code. Production data is unpredictable. Always set handle_unknown='ignore' unless you have absolute certainty that your category sets are closed and can never receive new values.
The fourth mistake is using lambda functions in FunctionTransformer when you plan to serialize the pipeline. As noted earlier, lambdas can't be pickled by joblib. Define named functions at module level, or use a custom class. This mistake is especially common when you prototype in a notebook, where lambdas feel natural, and then try to move that code to a production service.
The fifth mistake is not testing your serialized pipeline on realistic data before deploying. Serialize it, reload it fresh, run it on held-out data, and verify that predictions match what the original fitted pipeline produces. This catches subtle issues like version mismatches or serialization failures before they surface in production at the worst possible moment.
Debugging and Introspection
Pipelines are "black boxes" by default, you fit and predict, but what's happening inside?
Get feature names after transformation. This is especially valuable with OneHotEncoder, where the output column names carry semantic meaning about which category they represent:
column_transformer = ColumnTransformer([
('numeric', StandardScaler(), numeric_features),
('categorical', OneHotEncoder(sparse_output=False), categorical_features)
])
column_transformer.set_output(transform='pandas')
column_transformer.fit(X)
# Get feature names
feature_names = column_transformer.get_feature_names_out()
print(feature_names)Access intermediate transformations. If your pipeline produces unexpected results, isolating each step's output is the fastest path to the source of the problem:
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipe.fit(X_train, y_train)
# Get the fitted scaler
scaler = pipe.named_steps['scaler']
print(f"Feature means: {scaler.mean_}")
print(f"Feature scales: {scaler.scale_}")
# Get the fitted classifier
classifier = pipe.named_steps['classifier']
print(f"Coefficients: {classifier.coef_}")Print the whole pipeline to get a summary of its structure:
print(pipe)Output:
Pipeline(steps=[('scaler', StandardScaler()),
('classifier', LogisticRegression())])
For ColumnTransformers with many steps, use verbose=True during fit to see what's happening:
column_transformer = ColumnTransformer([...], verbose=True)
column_transformer.fit(X)
# Prints: [ColumnTransformer] .... preprocessing ... (n_jobs=1)Practical Example: End-to-End
Let's tie it all together with a realistic scenario. This example uses a small Titanic-like dataset to demonstrate every major pipeline concept in combination: nested preprocessing pipelines, missing value handling, categorical encoding, hyperparameter search, and serialization.
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# Load data (Titanic as example)
X = pd.DataFrame({
'age': [25, 35, np.nan, 55, 30],
'fare': [7.25, 71.28, 7.92, 51.86, 8.05],
'sex': ['male', 'female', 'female', 'male', 'male'],
'embarked': ['S', 'C', 'S', 'S', 'Q']
})
y = pd.Series([0, 1, 1, 1, 0])
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Define columns
numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']
# Preprocessing: scale numeric, one-hot encode categorical
column_transformer = ColumnTransformer([
('numeric', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), numeric_features),
('categorical', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
# Full pipeline
pipe = Pipeline([
('preprocessing', column_transformer),
('classifier', RandomForestClassifier(n_estimators=10, random_state=42))
])
# Tune hyperparameters
param_grid = {
'classifier__max_depth': [3, 5, None],
'classifier__min_samples_split': [2, 5]
}
grid = GridSearchCV(pipe, param_grid, cv=3)
grid.fit(X_train, y_train)
# Evaluate
print(f"Best CV score: {grid.best_score_:.3f}")
print(f"Test score: {grid.best_estimator_.score(X_test, y_test):.3f}")
# Save for production
joblib.dump(grid.best_estimator_, 'titanic_model.pkl')Notice:
- We handle missing values inside the numeric pipeline with SimpleImputer.
- Categorical features are one-hot encoded.
- No data leakage: train/test are separate before preprocessing.
- Hyperparameter tuning uses the
__syntax. - The best model is serialized and ready to deploy.
When new data arrives in production, you load the pipeline and predict. The entire preprocessing chain, imputation, scaling, encoding, is already baked in and ready to apply consistently to whatever raw data you feed it:
model = joblib.load('titanic_model.pkl')
new_passenger = pd.DataFrame({
'age': [28],
'fare': [50],
'sex': ['female'],
'embarked': ['S']
})
print(model.predict(new_passenger))Same transformations. Same model. Consistent results.
Advanced Patterns: Pipelines in Real Workflows
Beyond basics, here's where pipelines shine in production systems.
Parallel Processing with n_jobs
ColumnTransformers and Pipelines support parallel processing. If you have multiple columns or transformations, let scikit-learn use multiple CPU cores:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
column_transformer = ColumnTransformer([
('numeric', StandardScaler(), numeric_features),
('categorical', OneHotEncoder(sparse_output=False), categorical_features),
], n_jobs=-1) # Use all available cores
column_transformer.fit(X_train)
X_transformed = column_transformer.transform(X_test)The -1 means "use all cores." For large datasets, this cuts runtime significantly.
Pipeline Cloning and Cross-Validation
Under the hood, GridSearchCV clones your pipeline for each fold. This ensures each fold has fresh estimators without learned parameters from other folds.
You can clone manually too:
from sklearn.base import clone
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Clone (create a fresh, unfitted copy)
pipe_copy = clone(pipe)
# Originals still fit
print(pipe.named_steps['scaler'].mean_) # Fitted (has mean_)
print(hasattr(pipe_copy.named_steps['scaler'], 'mean_')) # False (unfitted)This is useful when you need multiple independent runs or custom cross-validation logic.
Conditional Pipelines with ColumnTransformer Remainder
What if you have columns you don't want to transform at all?
from sklearn.compose import ColumnTransformer
column_transformer = ColumnTransformer([
('numeric', StandardScaler(), numeric_features),
('categorical', OneHotEncoder(sparse_output=False), categorical_features),
], remainder='passthrough') # Keep untransformed columns as-isWith remainder='passthrough', any column not explicitly mentioned is passed through unchanged. With remainder='drop' (default), unmapped columns are discarded.
Feature Selection Inside Pipelines
You can embed feature selection into your pipeline:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('preprocessing', column_transformer),
('feature_selection', SelectKBest(f_classif, k=10)),
('classifier', LogisticRegression())
])
pipe.fit(X_train, y_train)
# See which features were selected
selected_mask = pipe.named_steps['feature_selection'].get_support()
selected_features = X_train.columns[selected_mask]
print(f"Selected features: {selected_features}")Now hyperparameter tuning can also search over k:
param_grid = {
'feature_selection__k': [5, 10, 15, 20],
'classifier__C': [0.1, 1.0, 10.0]
}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)Performance: When to Use Pipelines vs. Manual Preprocessing
Use pipelines when:
- You're training multiple models (consistency matters).
- You'll deploy to production (reproducibility matters).
- You're tuning hyperparameters (cross-validation needs separation).
- Your preprocessing is complex (multiple steps, multiple types).
- You want serialization (save and load the whole workflow).
Manual preprocessing is okay when:
- You're in exploratory data analysis mode (trying things, discarding them).
- You're working with a tiny dataset (efficiency doesn't matter).
- Your preprocessing is trivial (one scaler, done).
In practice? Use pipelines. The small upfront cost pays off immediately.
Versioning and Reproducibility
Pipelines are great for reproducibility, but you need discipline:
- Pin dependency versions: scikit-learn 1.0 vs 1.2 might produce slightly different results.
pip install scikit-learn==1.3.2- Pin random seeds: If any step uses randomness (e.g., randomized PCA), set
random_state:
pipe = Pipeline([
('pca', PCA(n_components=10, random_state=42)),
('classifier', LogisticRegression(random_state=42))
])- Document your pipeline:
"""
Production ML pipeline for customer churn prediction.
Steps:
1. Numeric scaling: StandardScaler on age, income, usage.
2. Categorical encoding: OneHotEncoder on job_title, region.
3. Classification: LogisticRegression with C=1.0.
Trained on 50k customer records, 2025-02-25.
Target metric: AUC-ROC 0.87 on holdout test set.
"""
pipe = Pipeline(...)- Test your serialization:
# After fitting
joblib.dump(pipe, 'model.pkl')
# Reload and verify
reloaded_pipe = joblib.load('model.pkl')
# Predictions should match
assert np.allclose(
pipe.predict_proba(X_test),
reloaded_pipe.predict_proba(X_test)
)Real-World Gotcha: OneHotEncoder and Unknown Categories
When you one-hot encode during training, you learn which categories exist. At prediction time, if a new category appears, what happens?
# Training: categories are ['male', 'female']
encoder = OneHotEncoder(handle_unknown='error') # Default
encoder.fit(X_train[['sex']])
# Prediction: new data has 'other'
# This FAILS
encoder.transform(X_test[['sex']]) # Error!Use handle_unknown='ignore':
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(X_train[['sex']])
# New categories are ignored (set to 0)
result = encoder.transform(X_test[['sex']])Now unknown categories become all-zeros, which your classifier treats as a "null" encoding. It's not perfect, but it's production-safe.
Exporting Pipelines to Production Formats
Joblib is great for Python-only systems. But what if your production environment is Node.js? Java? Go?
Options:
- ONNX (Open Neural Network Exchange)
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 4]))]
onyx_model = convert_sklearn(pipe, initial_types=initial_type)
# Save as ONNX
with open('model.onnx', 'wb') as f:
f.write(onyx_model.SerializeToString())Now any language with ONNX support can load and run your pipeline.
- REST API wrapper
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
pipe = joblib.load('model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
features = [[data['age'], data['income']]]
prediction = pipe.predict(features)
return jsonify({'prediction': int(prediction[0])})Deploy the Flask app. Other languages call your HTTP endpoint.
- Containerization
FROM python:3.10
COPY model.pkl .
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]Docker handles all dependencies and deployment.
Conclusion
Pipelines are one of those tools that, once you start using them properly, you wonder how you ever wrote machine learning code without them. The answer, usually, is that you wrote code that had subtle bugs, inconsistent preprocessing, leaky test sets, production models that behaved differently from their development counterparts, and you either caught those bugs eventually, or you didn't.
The mental shift that Pipelines enable is treating your preprocessing and your model as a single, atomic artifact rather than a sequence of steps you must apply correctly each time. You build the Pipeline once. You fit it once. Then you serialize it, ship it, load it in production, and call predict. Everything stays consistent because consistency is enforced by structure rather than by memory and discipline.
ColumnTransformer extends this to the reality of mixed-type tabular data, where numeric and categorical columns require fundamentally different treatments. Instead of managing two preprocessing workflows and manually concatenating their outputs, you declare the routing once, and the framework handles the rest. Custom Transformers let you extend this system to any preprocessing logic your dataset requires, following the same fit/transform contract that makes the whole ecosystem composable.
The patterns in this article, nested pipelines for multi-step column groups, set_output for pandas-aware workflows, the double-underscore syntax for hyperparameter search, module-level functions for serializable transformers, handle_unknown='ignore' for production safety, are the patterns that show up in real production ML systems. They're worth learning not as trivia but as the vocabulary of reliable machine learning engineering.
Use Pipelines. Your future self, and your production system, will thank you.