Exploratory Data Analysis (EDA) in Machine Learning: Unveiling Insights Beyond the Surface
In the realm of machine learning, understanding and preparing your data is a cornerstone of success. Exploratory Data Analysis (EDA) serves as the compass guiding you through the intricate landscape of your dataset. In this blog, we will deeply look into EDA, leaving no stone unturned. From its fundamental principles to its types, impact on model performance, we will cover it all. If you are new to Machine Learning and EDA but very curious to learn and explore it, like me, then this blog is for you.
Why is Exploratory Data Analysis (EDA) Essential?
Exploratory Data Analysis is the art of dissecting, visualizing, and comprehending the structure and nuances of your dataset. It provides a solid foundation for the subsequent stages of the machine-learning pipeline and is instrumental in ensuring data quality.
EDA helps you in the process of understanding, cleaning, and preparing your data. Its importance lies in:
Data Understanding: EDA provides a comprehensive view of your dataset, helping you comprehend its attributes, patterns, and anomalies.
Data Quality Enhancement: It assists in identifying and handling challenges such as missing values, outliers, and inconsistent data.
Feature Insights: EDA guides the selection and creation of relevant features, leading to more accurate and impactful models.
Example: EDA for Handling Missing Values
Missing values can hinder model performance. EDA helps identify and address these gaps in the data. Once you've located these missing values, they can either be 'filled' or ignored to better the data.
import pandas as pd
data = {'Age': [25, 30, None, 28, 22, 40, None, 32, 29, 35],
'Income': [50000, 60000, None, 55000, None, 80000, 95000, 72000, 65000, 70000]}
df = pd.DataFrame(data)
# EDA: Visualizing Missing Values
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Value Heatmap')
plt.show()
# EDA: Impute Missing Values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)
Output:
Example: EDA for Feature Encoding
Categorical variables often need encoding for model consumption. EDA guides the selection and application of suitable encoding techniques. If you'd like to know more about handling categorical data, then click here.
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
'Education': ['Graduate', 'Undergraduate', 'Graduate', 'Graduate', 'Postgraduate', 'Undergraduate']}
df = pd.DataFrame(data)
# EDA: Visualizing Categorical Features
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Gender', data=df)
plt.title('Distribution of Gender')
plt.show()
sns.countplot(x='Education', data=df)
plt.title('Distribution of Education')
plt.xticks(rotation=45)
plt.show()
# EDA: Applying One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['Gender', 'Education'])
Output:
Types of Exploratory Data Analysis
Univariate Analysis
The univariate analysis examines individual features in isolation. It provides insights into the distribution and central tendencies of each variable.
Example - Histogram for Feature Distribution: Let's visualize the distribution of ages in a dataset using a histogram.
import matplotlib.pyplot as plt
import pandas as pd
data = {'Age': [25, 30, 32, 28, 22, 40, 55, 32, 29, 35]}
df = pd.DataFrame(data)
plt.hist(df['Age'], bins=5, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()
Output:
Bivariate Analysis
Bivariate analysis explores relationships between pairs of features. It reveals correlations and dependencies.
Example - Scatter Plot for Correlation: Let's examine the correlation between hours studied and exam scores.
import matplotlib.pyplot as plt
import pandas as pd
data = {'Hours_Studied': [2, 4, 6, 8, 10],
'Exam_Score': [50, 70, 75, 90, 95]}
df = pd.DataFrame(data)
plt.scatter(df['Hours_Studied'], df['Exam_Score'])
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Correlation between Hours Studied and Exam Score')
plt.show()
Output:
Multivariate Analysis
Multivariate analysis involves studying interactions among multiple features. It uncovers complex relationships and patterns.
Example - Pair Plot: Using a pair plot, we can visualize pairwise relationships across multiple features.
import seaborn as sns
import pandas as pd
data = {'Feature1': [1, 2, 3, 4, 5],
'Feature2': [2, 3, 5, 4, 6],
'Feature3': [5, 6, 8, 7, 9]}
df = pd.DataFrame(data)
sns.pairplot(df)
plt.show()
Output:
Synergy of EDA and Clustering
Clustering is an unsupervised learning technique that groups similar data points. EDA helps identify features that drive clustering patterns.
Example - K-Means Clustering with EDA Insights
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
data = {'Feature1': [1, 2, 3, 10, 11, 12, 20, 21, 22],
'Feature2': [2, 3, 5, 15, 16, 17, 25, 26, 27]}
df = pd.DataFrame(data)
# EDA: Visualize Data Distribution
plt.scatter(df['Feature1'], df['Feature2'])
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.title('Data Distribution')
plt.show()
# EDA-Informed K-Means Clustering
kmeans = KMeans(n_clusters=2)
df['Cluster'] = kmeans.fit_predict(df[['Feature1', 'Feature2']])
# EDA: Visualize Clusters
plt.scatter(df['Feature1'], df['Feature2'], c=df['Cluster'], cmap='rainbow')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.title('K-Means Clustering')
plt.show()
Output:
EDA helps in identifying Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect outliers, and the removal process of these outliers from the data frame is the same as removing a data item from the panda’s data frame.
Handling Outliers
Example:
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('Iris.csv')
sns.boxplot(x='SepalWidthCm', data=df)
Output:
Removing Outliers
For removing the outlier, one must follow the same process of removing an entry from the dataset using its exact position in the dataset because in all the above methods of detecting the outliers end result is the list of all those data items that satisfy the outlier definition according to the method used.
# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns
# Load the dataset
df = pd.read_csv('Iris.csv')
# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
print("Old Shape: ", df.shape)
# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))
# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
print("New Shape: ", df.shape)
sns.boxplot(x='SepalWidthCm', data=df)
Output:
Impact of EDA on Model Performance
Effective EDA directly impacts model performance and generalization. It enhances your ability to:
Make informed decisions about data preprocessing and feature engineering.
Optimize hyperparameters based on data insights.
Address challenges such as class imbalance or multicollinearity.
Example of EDA used in problem-solving
Heart Stroke Detection
To solve this problem, some of the important EDA techniques are used to display the impact and importance of it. The insights it can provide in understanding one's data can be observed. You can check it out in the links given below. If you'd like to try it out for yourself, there's a link provided for the dataset too. Have fun!
Source Code (IPYNB file): Click here
Dataset: Click here
Conclusion
Exploratory Data Analysis is your ally in the quest for understanding and harnessing the potential of your data. From handling missing values to encoding categorical features, EDA equips you with insights and techniques that lay the groundwork for robust model building. Its synergy with clustering amplifies its impact, enabling you to unravel hidden patterns and relationships. By embracing EDA, you unlock a world of data-driven insights that drive the success of your machine-learning endeavors. Happy exploring!