Machine Learning Pipeline
Pipeline chains together multiple steps so that the output of each step is used as input to the next step.
Pipelines make it easy to apply the same preprocessing to train and test.
Without pipeline steps:
1. Applying Imputation
- Imputation means filling the null values in the data by using the simple imputer in sklearn.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
trf1 = ColumnTransformer([
("impute", SimpleImputer(), [1]) # 1 means first column
("impute_cat", SimpleImputer(strategy="most_frequent"), [6])
], remainder="passthrough")
2. OneHot Encoding
- OneHot encoding means converting the categorical columns to numeric columns by using the OneHotEncoding in sklearn.
from sklearn.preprocessing import OneHotEncoder
trf2 = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'),[1,6])
], remainder='passthrough')
3. Scaling
- Scaling means scaling all numbers in the 0 to 1 range by using the MinMaxScaler in sklearn.
from sklearn.preprocessing import MinMaxScaler
trf3 = ColumnTransformer([
('scale', MinMaxScaler(), slice(0,8)) # Slice means apply on 0 to 8 all columns
])
4. Train the model
- Define the model here we are using the DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
trf4 = DecisionTreeClassifier()
With the pipeline:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("trf1", trf1),
("trf2", trf2),
("trf3", trf3),
("trf4", trf4),
])
pipe.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Use the make_pipeline function in sklearn
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(trf1, trf2, trf3, trf4)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
How to see the full pipeline in visual format
from sklearn import set_config
set_config(display='diagram')
pipe
This is a basic guide about pipeline implementation in sklearn. It is easy to use all the preprocessing steps and model fitting in a one stop solution for all.
Thanks for feedback.