Machine Learning Pipeline

Pipeline chains together multiple steps so that the output of each step is used as input to the next step.

Pipelines make it easy to apply the same preprocessing to train and test.

Without pipeline steps:

1. Applying Imputation

Imputation means filling the null values in the data by using the simple imputer in sklearn.

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

trf1 = ColumnTransformer([
    ("impute", SimpleImputer(), [1]) # 1 means first column
    ("impute_cat", SimpleImputer(strategy="most_frequent"), [6])
], remainder="passthrough")

2. OneHot Encoding

OneHot encoding means converting the categorical columns to numeric columns by using the OneHotEncoding in sklearn.

from sklearn.preprocessing import OneHotEncoder

trf2 = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'),[1,6])
], remainder='passthrough')

3. Scaling

Scaling means scaling all numbers in the 0 to 1 range by using the MinMaxScaler in sklearn.

from sklearn.preprocessing import MinMaxScaler

trf3 = ColumnTransformer([
    ('scale', MinMaxScaler(), slice(0,8)) # Slice means apply on 0 to 8 all columns
])

4. Train the model

Define the model here we are using the DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

trf4 = DecisionTreeClassifier()

With the pipeline:

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("trf1", trf1),
    ("trf2", trf2),
    ("trf3", trf3),
    ("trf4", trf4),
])

pipe.fit(X_train, y_train) 
y_pred = pipeline.predict(X_test)

Use the make_pipeline function in sklearn

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(trf1, trf2, trf3, trf4)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

How to see the full pipeline in visual format

from sklearn import set_config

set_config(display='diagram')

pipe

This is a basic guide about pipeline implementation in sklearn. It is easy to use all the preprocessing steps and model fitting in a one stop solution for all.

Dislike

Thanks for feedback.