Feature Selection In Machine Learning
This method is important for scaling the ML model and getting the improvement in accuracy score.
What is feature selection?
Feature selection in machine learning refers to the process of selecting a subset of relevant and significant features (variables or attributes) from a larger set of features to build a model.
Importance of the feature selection
Feature selection is important for which feature is referred to as the target for getting the best prediction by the machine learning model. It also decreased the columns in the data so it increased the speed of model training.
Methods
- Filter Method
- Wrapper Method
- Embedded Method
Filter Method
- The filter method evaluates the feature's importance independently. The mechanism uses statistical methods for the feature selection which rank the feature based on its characteristics. The high-scoring features are then selected for inclusion in the final feature set.
Implementation of the filter method by using sklearn
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load the iris dataset for demonstration purposes
iris = load_iris()
X, y = iris.data, iris.target
# Create an instance of SelectKBest with the chi-squared test
# k is the number of top features you want to select
k_best = SelectKBest(score_func=chi2, k=2)
# Fit and transform the data to get the selected features
X_new = k_best.fit_transform(X, y)
# Print the selected features
print("Original features:", X.shape[1])
print("Selected features:", X_new.shape[1])
print("Selected feature indices:", k_best.get_support(indices=True))
Wrapper Method
- In the wrapper method for choosing the feature based on the machine learning algorithm. It is not checked independently as a filter method.
- The process is iterative and computationally more expensive compared to filter methods.
- Most commonly used techniques for wrapper method:
- Forward selection
- Backward elimination
- Bi-directional elimination(Stepwise Selection)
Example code for the wrapper method in sklearn.
from sklearn.datasets import load_iris
from sklearn.feature_selection import RFE
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the iris dataset for demonstration
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an SVM classifier
svm_classifier = SVC(kernel="linear")
# Create an RFE selector using the SVM classifier
# Select the top 2 features (you can adjust this based on your requirements)
rfe_selector = RFE(estimator=svm_classifier, n_features_to_select=2, step=1)
# Fit RFE on the training data
rfe_selector.fit(X_train, y_train)
# Get the selected features
selected_features = rfe_selector.support_
# Transform the training and testing data using the selected features
X_train_selected = rfe_selector.transform(X_train)
X_test_selected = rfe_selector.transform(X_test)
# Train a model (SVM in this case) on the selected features
svm_classifier.fit(X_train_selected, y_train)
# Make predictions on the test set
y_pred = svm_classifier.predict(X_test_selected)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on the test set:", accuracy)
Embedded Method
- Embedded methods combine the qualities of filter and wrapper methods. It’s implemented by algorithms that have built-in feature selection methods.
- These methods optimize both the model’s performance and the relevance of features simultaneously.
- Ridge and Lasso Regression are most used for this method.
Example Code for implementation of Embedded method using the Logistic Regression.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the iris dataset for demonstration
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Logistic Regression model with L1 regularization
# The 'C' parameter controls the strength of regularization; smaller values lead to stronger regularization
logreg_model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
# Fit the model on the training data
logreg_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = logreg_model.predict(X_test)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on the test set:", accuracy)
# Extract feature importance or coefficients
feature_importance = logreg_model.coef_
# Print the feature importance
print("Feature Importance (Coefficients):", feature_importance)
Thanks for feedback.