Machine Learning Pipeline

Pipeline chains together multiple steps so that the output of each step is used as input to the next step.

Pipelines make it easy to apply the same preprocessing to train and test.

Without pipeline steps:


1. Applying Imputation
  • Imputation means filling the null values in the data by using the simple imputer in sklearn.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

trf1 = ColumnTransformer([
    ("impute", SimpleImputer(), [1]) # 1 means first column
    ("impute_cat", SimpleImputer(strategy="most_frequent"), [6])
], remainder="passthrough")


2. OneHot Encoding
  • OneHot encoding means converting the categorical columns to numeric columns by using the OneHotEncoding in sklearn.
from sklearn.preprocessing import OneHotEncoder

trf2 = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'),[1,6])
], remainder='passthrough')


3. Scaling
  • Scaling means scaling all numbers in the 0 to 1 range by using the MinMaxScaler in sklearn.
from sklearn.preprocessing import MinMaxScaler

trf3 = ColumnTransformer([
    ('scale', MinMaxScaler(), slice(0,8)) # Slice means apply on 0 to 8 all columns


4. Train the model
  • Define the model here we are using the DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

trf4 = DecisionTreeClassifier()


With the pipeline


from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("trf1", trf1),
    ("trf2", trf2),
    ("trf3", trf3),
    ("trf4", trf4),
]), y_train) 
y_pred = pipeline.predict(X_test)

Use the make_pipeline function in sklearn

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(trf1, trf2, trf3, trf4), y_train)
y_pred = pipeline.predict(X_test)


How to see the full pipeline in visual format

from sklearn import set_config



This is a basic guide about pipeline implementation in sklearn. It is easy to use all the preprocessing steps and model fitting in a one stop solution for all.

Thanks for feedback.

Read More....
Common Issues In Training ML Model
Custom Logistic Regression with Implementation
Exploratory Data Analysis - EDA
Feature Selection In Machine Learning
Machine Learning: Beginner to Pro Roadmap