Exploratory Data Analysis - EDA


Full guide about EDA by using the Matplotlib library in Python.

 

What is EDA?

  • The simple meaning of the EDA is to explore your data in a visual format and get an idea about trends of data.

EDA is more and more important in the data science field. There are many different types of visualization plots available.

 

What is Matplotlib?

  • Matplotlib is a Python library for plotting different types of graphs by data. It is very easy to learn.

 

Useful types of the graph in matplotlib

 

Histograms

Visualize the distribution of a single variable.

Helps identify patterns, skewness, and outliers.

import matplotlib.pyplot as plt

# Plotting a histogram
plt.hist(data['column_name'], bins=30, color='blue', edgecolor='black')
plt.title('Histogram Title')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.show()

 

Scatter Plots

A scatter plot is important for comparing two or more fields with each other.

It can detect clusters and patterns in the data.

import matplotlib.pyplot as plt

# Plotting a scatter plot
plt.scatter(data['column1'], data['column2'], color='red', marker='o')
plt.title('Scatter Plot between Column1 and Column2')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.show()

 

Box Plots

Mainly used for detecting the outliers in the data.

import matplotlib.pyplot as plt

# Plotting a box plot
plt.boxplot(data['column_name'])
plt.title('Box Plot of Column Name')
plt.show()

 

​​​​​​​Pair Plots

Pair plots are used for detecting the column's similarity or relations between columns with each other.

It is also used for feature selection in machine learning.

Pair plot is not available in Matplotlib so we use the seaborn library for that.

import seaborn as sns

# Using seaborn for pair plot
sns.pairplot(data)
plt.show()

 

​​​​​​​Correlation Heatmap

Identify strong relationships between variables.

import seaborn as sns

# Creating a correlation matrix
correlation_matrix = data.corr()

# Plotting a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

 

​​​​​​​Bar Charts

Display the distribution of categorical variables.

Compare the frequencies of different categories.

import matplotlib.pyplot as plt

# Plotting a bar chart
data['category_column'].value_counts().plot(kind='bar', color='green')
plt.title('Bar Chart of Category Column')
plt.xlabel('Categories')
plt.ylabel('Frequency')
plt.show()

 

​​​​​​​Line Plots

Best for the time series data analysis. Common for all types of data.

import matplotlib.pyplot as plt

# Plotting a line chart
plt.plot(data['time_column'], data['value_column'], marker='o', linestyle='-', color='blue')
plt.title('Line Chart of Time Column and Value Column')
plt.xlabel('Time')
plt.ylabel('Values')
plt.show()

 



Thanks for feedback.



Read More....
Common Issues In Training ML Model
Custom Logistic Regression with Implementation
Feature Selection In Machine Learning
Machine Learning Pipeline
Machine Learning: Beginner to Pro Roadmap