Outlier Detection and Removal



Full guide about how to deal with outliers and how to overcome them and convert data best fit for use cases.

Outliers have a significant impact on model accuracy. Various methods are used for outlier detection. We will discuss the main methods for detecting and removing the outlier in data by using the pandas.

 

1. Z-Score or Standard Score

  • This method involves calculating the z-score of each data point. Data points with a z-score beyond a certain threshold (commonly set to 3) are considered outliers.

 

Find the Z-Score by using the pandas.
import pandas as pd
import numpy as np

np.random.seed(42)
data = pd.DataFrame({
    'Feature1': np.random.normal(0, 1, 1000),
    'Feature2': np.random.normal(0, 1, 1000),
})

def detect_outliers_zscore(df, threshold=3):
    z_scores = ((df - df.mean()) / df.std()).abs()
    return z_scores > threshold

outliers = detect_outliers_zscore(data)

# Remove outliers from the data
cleaned_data = data[~outliers].dropna()

print("Original Data:")
print(data.head())

print("\nOutliers:")
print(outliers.head())

print("\nCleaned Data:")
print(cleaned_data.head())

 

2. IQR (Interquartile Range) Method

  • The interquartile range is the range between the first quartile (Q1) and the third quartile (Q3) of the data. Outliers are identified as points outside the range [Q1–1.5 * IQR, Q3 + 1.5 * IQR].

 

How to detect the outlier by IQR in pandas.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
data = pd.DataFrame({
    'Feature1': np.random.normal(0, 1, 1000),
    'Feature2': np.random.normal(0, 1, 1000),
})

# Introduce some outliers
data.iloc[0, 0] = 10  # An outlier in Feature1
data.iloc[1, 1] = -8  # An outlier in Feature2

# Standardize the data
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

# Function to detect outliers using IQR
def detect_outliers_iqr(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    return (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))

outliers_iqr = detect_outliers_iqr(scaled_data)

# Remove outliers from the data
cleaned_data_knn = data[~outliers_iqr].dropna()

print("Original Data:")
print(data.head())

print("\nOutliers - IQR:")
print(outliers_iqr.head())

print("\nCleaned Data - IQR:")
print(cleaned_data_iqr.head())

 

3. K-Nearest Neighbors (KNN):

  • KNN-based methods classify a data point as an outlier if its distance to its k-nearest neighbors exceeds a certain threshold.

 

Implementation of KNN for detecting the outlier in data.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

# Generate sample data
np.random.seed(42)
data = pd.DataFrame({
    'Feature1': np.random.normal(0, 1, 1000),
    'Feature2': np.random.normal(0, 1, 1000),
})

# Introduce some outliers
data.iloc[0, 0] = 10  # An outlier in Feature1
data.iloc[1, 1] = -8  # An outlier in Feature2

# Standardize the data
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

# Function to detect outliers using KNN
def detect_outliers_knn(df, k=5, threshold_distance=2.0):
    # Fit a KNN model
    knn_model = NearestNeighbors(n_neighbors=k)
    knn_model.fit(df)

    # Compute distances to k-nearest neighbors for each point
    distances, _ = knn_model.kneighbors(df)

    # Use the median distance as a measure of central tendency
    median_distance = np.median(distances[:, 1:])

    # Classify points with distances exceeding a threshold as outliers
    outlier_mask = (distances[:, -1] > threshold_distance * median_distance)

    return outlier_mask

# Detect outliers using KNN
outliers_knn = detect_outliers_knn(scaled_data)

# Remove outliers from the data
cleaned_data_knn = data[~outliers_knn].dropna()

# Display the original data and highlighted outliers
print("Original Data:")
print(data.head())

print("\nOutliers - KNN:")
print(outliers_knn[:5])

print("\nCleaned Data - KNN:")
print(cleaned_data_knn.head())

 

4. Percentile-based Method

  • Remove data points above a certain percentile (e.g., 99th percentile).
upper_limit = data.quantile(0.99)
outliers = data > upper_limit
cleaned_data = data[~outliers].dropna()

 



Thanks for feedback.



Read More....
Common Issues In Training ML Model
Custom Logistic Regression with Implementation
Exploratory Data Analysis - EDA
Feature Selection In Machine Learning
Machine Learning Pipeline
Machine Learning: Beginner to Pro Roadmap