Exploratory Data Analysis (EDA) in Machine Learning: Unveiling Insights Beyond the Surface

·

6 min read

Exploratory Data Analysis (EDA) in Machine Learning: Unveiling Insights Beyond the Surface

In the realm of machine learning, understanding and preparing your data is a cornerstone of success. Exploratory Data Analysis (EDA) serves as the compass guiding you through the intricate landscape of your dataset. In this blog, we will deeply look into EDA, leaving no stone unturned. From its fundamental principles to its types, impact on model performance, we will cover it all. If you are new to Machine Learning and EDA but very curious to learn and explore it, like me, then this blog is for you.

Why is Exploratory Data Analysis (EDA) Essential?

Exploratory Data Analysis is the art of dissecting, visualizing, and comprehending the structure and nuances of your dataset. It provides a solid foundation for the subsequent stages of the machine-learning pipeline and is instrumental in ensuring data quality.

EDA helps you in the process of understanding, cleaning, and preparing your data. Its importance lies in:

  1. Data Understanding: EDA provides a comprehensive view of your dataset, helping you comprehend its attributes, patterns, and anomalies.

  2. Data Quality Enhancement: It assists in identifying and handling challenges such as missing values, outliers, and inconsistent data.

  3. Feature Insights: EDA guides the selection and creation of relevant features, leading to more accurate and impactful models.

Example: EDA for Handling Missing Values

Missing values can hinder model performance. EDA helps identify and address these gaps in the data. Once you've located these missing values, they can either be 'filled' or ignored to better the data.

import pandas as pd

data = {'Age': [25, 30, None, 28, 22, 40, None, 32, 29, 35],
        'Income': [50000, 60000, None, 55000, None, 80000, 95000, 72000, 65000, 70000]}

df = pd.DataFrame(data)

# EDA: Visualizing Missing Values
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False)
plt.title('Missing Value Heatmap')
plt.show()

# EDA: Impute Missing Values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)

Output:

Example: EDA for Feature Encoding

Categorical variables often need encoding for model consumption. EDA guides the selection and application of suitable encoding techniques. If you'd like to know more about handling categorical data, then click here.

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
        'Education': ['Graduate', 'Undergraduate', 'Graduate', 'Graduate', 'Postgraduate', 'Undergraduate']}

df = pd.DataFrame(data)

# EDA: Visualizing Categorical Features
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Gender', data=df)
plt.title('Distribution of Gender')
plt.show()

sns.countplot(x='Education', data=df)
plt.title('Distribution of Education')
plt.xticks(rotation=45)
plt.show()

# EDA: Applying One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['Gender', 'Education'])

Output:

Types of Exploratory Data Analysis

Univariate Analysis

The univariate analysis examines individual features in isolation. It provides insights into the distribution and central tendencies of each variable.

Example - Histogram for Feature Distribution: Let's visualize the distribution of ages in a dataset using a histogram.

import matplotlib.pyplot as plt
import pandas as pd

data = {'Age': [25, 30, 32, 28, 22, 40, 55, 32, 29, 35]}
df = pd.DataFrame(data)

plt.hist(df['Age'], bins=5, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

Output:

Bivariate Analysis

Bivariate analysis explores relationships between pairs of features. It reveals correlations and dependencies.

Example - Scatter Plot for Correlation: Let's examine the correlation between hours studied and exam scores.

import matplotlib.pyplot as plt
import pandas as pd

data = {'Hours_Studied': [2, 4, 6, 8, 10],
        'Exam_Score': [50, 70, 75, 90, 95]}

df = pd.DataFrame(data)
plt.scatter(df['Hours_Studied'], df['Exam_Score'])
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Correlation between Hours Studied and Exam Score')
plt.show()

Output:

Multivariate Analysis

Multivariate analysis involves studying interactions among multiple features. It uncovers complex relationships and patterns.

Example - Pair Plot: Using a pair plot, we can visualize pairwise relationships across multiple features.

import seaborn as sns
import pandas as pd

data = {'Feature1': [1, 2, 3, 4, 5],
        'Feature2': [2, 3, 5, 4, 6],
        'Feature3': [5, 6, 8, 7, 9]}

df = pd.DataFrame(data)
sns.pairplot(df)
plt.show()

Output:

Synergy of EDA and Clustering

Clustering is an unsupervised learning technique that groups similar data points. EDA helps identify features that drive clustering patterns.

Example - K-Means Clustering with EDA Insights

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = {'Feature1': [1, 2, 3, 10, 11, 12, 20, 21, 22],
        'Feature2': [2, 3, 5, 15, 16, 17, 25, 26, 27]}

df = pd.DataFrame(data)

# EDA: Visualize Data Distribution
plt.scatter(df['Feature1'], df['Feature2'])
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.title('Data Distribution')
plt.show()

# EDA-Informed K-Means Clustering
kmeans = KMeans(n_clusters=2)
df['Cluster'] = kmeans.fit_predict(df[['Feature1', 'Feature2']])

# EDA: Visualize Clusters
plt.scatter(df['Feature1'], df['Feature2'], c=df['Cluster'], cmap='rainbow')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.title('K-Means Clustering')
plt.show()

Output:

EDA helps in identifying Outliers

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect outliers, and the removal process of these outliers from the data frame is the same as removing a data item from the panda’s data frame.

Handling Outliers

Example:

# importing packages 
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('Iris.csv')

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Removing Outliers

For removing the outlier, one must follow the same process of removing an entry from the dataset using its exact position in the dataset because in all the above methods of detecting the outliers end result is the list of all those data items that satisfy the outlier definition according to the method used.

# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns

# Load the dataset
df = pd.read_csv('Iris.csv')

# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
                interpolation = 'midpoint')

Q3 = np.percentile(df['SepalWidthCm'], 75,
                interpolation = 'midpoint')
IQR = Q3 - Q1

print("Old Shape: ", df.shape)

# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))

# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))

# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)

print("New Shape: ", df.shape)

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Impact of EDA on Model Performance

Effective EDA directly impacts model performance and generalization. It enhances your ability to:

  • Make informed decisions about data preprocessing and feature engineering.

  • Optimize hyperparameters based on data insights.

  • Address challenges such as class imbalance or multicollinearity.

Example of EDA used in problem-solving

Heart Stroke Detection

To solve this problem, some of the important EDA techniques are used to display the impact and importance of it. The insights it can provide in understanding one's data can be observed. You can check it out in the links given below. If you'd like to try it out for yourself, there's a link provided for the dataset too. Have fun!

Source Code (IPYNB file): Click here

Dataset: Click here

Conclusion

Exploratory Data Analysis is your ally in the quest for understanding and harnessing the potential of your data. From handling missing values to encoding categorical features, EDA equips you with insights and techniques that lay the groundwork for robust model building. Its synergy with clustering amplifies its impact, enabling you to unravel hidden patterns and relationships. By embracing EDA, you unlock a world of data-driven insights that drive the success of your machine-learning endeavors. Happy exploring!

Did you find this article valuable?

Support Sanjay's blog by becoming a sponsor. Any amount is appreciated!