You are on page 1of 7

Exploratory Data Analysis

If you are a data scientist or a machine learning enthusiast, you probably know that EDA stands for
Exploratory Data Analysis.
But do you know why EDA is so important in the ML workflow?
In this blog post, I will try to explain why EDA is not just a preparatory step, but a critical one that can
make or break your ML project.

What EDA is
EDA is the process of exploring and understanding your data before applying any ML algorithms or
models.
It involves visualizing, summarizing, and finding patterns, outliers, and anomalies in your data.
EDA helps you to gain insights and intuition about your data, which can guide your ML choices and
improve your results.

Why EDA is performed in ML:


1. To identify and fix data quality issues:
This includes missing values, incorrect labels, duplicates, or errors.
These issues can affect the performance and accuracy of your ML models, so it is better to deal with
them early on.

2. To understand the data:


EDA helps you gain cognizance with the distribution, range, and variability of your data.

3. To choose the appropriate ML techniques:


Gaining understanding of the data via EDA helps in choosing the right ML techniques such as scaling,
normalization, transformation, or feature engineering, that can enhance your data and make it more
suitable for machine learning.

4. To select the most relevant features:


EDA helps you to discover the relationships and correlations between your variables. This can help
you to select the most relevant and informative features for your ML models, and avoid
multicollinearity or redundancy.

5. To generate/ engineer new features:


EDA provides inspiration or reveals avenues for creating/ generating new features i.e. by combining
or transforming existing ones.

6. To detect and handle outliers and anomalies:


Outliers are extreme values that deviate from the normal range of your data, while anomalies are
values that do not conform to the expected pattern or behaviour of your data.
Both outliers and anomalies can affect the performance and generalization of your ML models, so it is
important to identify them and decide how to deal with them (e.g., remove them, replace them, or
keep them).

7. To test your assumptions and hypotheses about your data:


Note that EDA isn’t enough to draw definitive conclusions but it helps in testing your intuitive
assumptions/ hypothesis about your data.
For example, you might have some prior knowledge or expectations about how your data should look
like, or how your variables should interact with each other. EDA can help you to validate or invalidate
these assumptions and hypotheses, and adjust them accordingly.
8. To communicate and present your findings and insights to others:
Your insights are only valuable if others understand them

EDA often involves creating visualizations, such as charts, graphs, plots, or maps, that can help you to
convey complex information in a simple and intuitive way.
Visualizations can also help you to tell a story with your data, and highlight the key points and
takeaways for your audience.

In conclusion
Building your ML models and selecting features based on intuition only (without painstakingly
carrying out EDA) is bad practice and will undermine the abilities of your model.
.
Exploratory Data Analysis in [ML]

What is exploratory data analysis?


Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods. It
helps determine how best to manipulate data sources to get the answers you need, making it
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis
testing task and provides a provides a better understanding of data set variables and the
relationships between them. It can also help determine if the statistical techniques you are
considering for data analysis are appropriate. Originally developed by American
mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used
method in the data discovery process today.

Why is exploratory data analysis important in data science?


The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers
or anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about
standard deviations, categorical variables, and confidence intervals. Once EDA is complete
and insights are drawn, its features can then be used for more sophisticated data analysis or
modeling, including machine learning.

Programming Language Used


Python: an interpreted, object-oriented programming language with dynamic semantics. Its
high-level, built-in data structures, combined with dynamic typing and dynamic binding,
make it very attractive for rapid application development, as well as for use as a scripting or
glue language to connect existing components together. Python and EDA can be used
together to identify missing values in a data set, which is important so you can decide how to
handle missing values for machine learning.
TYPES OF EXPLORATORY DATA ANALYSIS:
1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical

1. Univariate Non-graphical: As we only use one variable to research the data, this is the
most basic type of data analysis. Understanding the sample distribution and underlying
data in order to draw conclusions about the population is the basic objective of univariate
non-graphical EDA. The analysis also includes outlier detection. The population
distribution’s characteristics include:
 Central tendency: The average or middle values have something to do with the
central tendency or distribution location. Statistics with the names mean, median,
and occasionally mode are frequently useful gauges of central tendency, with mean
being the most prevalent. The median may be selected when there is a skewed
distribution or when outliers are a concern.
 Spread: Spread serves as a gauge for how far we should look to find the information
values from the centre. The variance and quality deviation are two helpful
measurements of spread. The variance is the root of the variance because it is the
mean of the square of each unique deviation.
 Skewness and kurtosis: Two more useful univariates descriptors are the skewness
and kurtosis of the distribution. Skewness is that the measure of asymmetry and
kurtosis may be a more subtle measure of peakedness compared to a normal
distribution
2. Multivariate Non-graphical: In cross-tabulation or statistics, the multivariate non-
graphical EDA technique is typically used to illustrate the relationship between two or more
variables.
 For categorical data, an extension of tabulation called cross-tabulation is extremely
useful. For 2 variables, cross-tabulation is preferred by making a two-way table with
column headings that match the amount of one-variable and row headings that
match the amount of the opposite two variables, then filling the counts with all
subjects that share an equivalent pair of levels.
 For each categorical variable and one quantitative variable, we create statistics for
quantitative variables separately for every level of the specific variable then compare
the statistics across the amount of categorical variable.
 Comparing the means is an off-the-cuff version of ANOVA and comparing medians may
be a robust version of one-way ANOVA.
3. Univariate graphical: Non-graphical methods are quantitative and objective, they are
doing not give the complete picture of the data; therefore, graphical methods are more
involve a degree of subjective analysis, also are required. Common sorts of univariate
graphics are:
 Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn a
lot about your data, including central tendency, spread, modality, shape and outliers.
 Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
 Boxplots: Another very useful univariate graphical technique is that the boxplot.
Boxplots are excellent at presenting information about central tendency and show
robust measures of location and spread also as providing information about symmetry
and outliers, although they will be misleading about aspects like multimodality. One
among the simplest uses of boxplots is within the sort of side-by-side boxplots.
 Quantile-normal plots: The ultimate univariate graphical EDA technique is that the
most intricate. it’s called the quantile-normal or QN plot or more generally the
quantile-quantile or QQ plot. it’s wont to see how well a specific sample follows a
specific theoretical distribution. It allows detection of non-normality and diagnosis of
skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships
between two or more sets of knowledge. The sole one used commonly may be a grouped
barplot with each group representing one level of 1 of the variables and every bar within a
gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
 Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that
the scatterplot , sohas one variable on the x-axis and one on the y-axis and therefore
the point for every case in your dataset.
 Run chart: It’s a line graph of data plotted over time.
 Heat map: It’s a graphical representation of data where values are depicted by color.
 Multivariate chart: It’s a graphical representation of the relationships between
factors and response.
 Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-
dimensional plot.
Exploratory Data Analysis (EDA) and Customer Segmentation of Credit Score
Classification Dataset
Introduction
Exploratory Data Analysis (EDA) is a crucial phase in any data science project, enabling data
scientists to gain insights, identify patterns, and prepare data for further analysis. This article
serves as a comprehensive guide to conducting EDA effectively. We will break down the
process into key steps and provide detailed insights at each stage.

1. Project Overview 📝
Project Title: Create a concise, descriptive title that encapsulates the main theme of your
analysis.
Goal of the project: Break down the project goals into specific objectives or research
questions. For instance, if your goal is to understand customer churn, you might have sub-
goals like “Identify key factors affecting churn rates” or “Segment customers based on
churn behavior.”
Dataset(s) used: List the dataset names, sources, formats, and any data preprocessing steps
you applied (e.g., cleaning, merging datasets).
Team Members: Specify the roles and responsibilities of each team member. Who was
responsible for data cleaning, analysis, visualization, and reporting?

2. Data Overview 📁
Source(s) of the Data: Provide detailed information about the data sources, including URLs,
databases, and any data retrieval methods.

Data Size: Include information on the number of records, the number of features (columns),
and the memory usage (e.g., “10,000 records with 20 features, consuming 5 MB of
memory”).

Brief Description of the Data: Elaborate on the context and significance of the data, including
how it relates to the project’s goals. Mention any data collection issues or peculiarities.

3. Data Cleaning and Preparation 🔧


Missing Value Treatment: Provide a step-by-step account of how you dealt with missing
data, which may involve data imputation methods such as mean, median, or advanced
techniques like regression imputation.

Outlier Detection & Treatment: Detail the process of identifying outliers and your approach
to handling them, such as using visualization, statistical tests, or filtering.

Feature Engineering: Elaborate on feature engineering, including the creation of new


features, transformations, and scaling. Explain the rationale behind these engineering
decisions.
4. Exploratory Data Analysis ‍
Univariate Analysis: Include a breakdown of the univariate analysis. Provide summary
statistics for each feature, histograms, kernel density plots, and discuss the distribution of
variables. Mention any data skewness.
Bivariate Analysis: Extend the analysis by explaining the bivariate exploration, including
scatter plots, correlation matrices, and any significant relationships discovered.

Multivariate Analysis: Describe how you conducted multivariate exploration, such as


clustering or principal component analysis, to uncover complex interactions between
variables.

5. Visualization 📊
Graphs and Plots: Provide a detailed inventory of the types of graphs and plots used for
visualization, with explanations of how each visualization was chosen for specific insights.
Insights Derived from the Visuals: Elaborate on the insights gained from the visualizations,
including patterns, anomalies, and trends. Relate these findings to the project’s goals.

6. Hypothesis Testing and Insights 🎯


Statistical Tests Used: Describe the specific statistical tests performed, including the null and
alternative hypotheses, test statistics, and degrees of freedom.
Findings and Insights: Explain the results of hypothesis tests and their implications. Discuss
any significant relationships or differences and their relevance to the project’s goals.

7. Final Report and Presentation 📑


Key Findings: Summarize the most critical findings in a concise and impactful manner. Use
visuals and key statistics to highlight these findings.
Limitations of the Analysis: Discuss potential limitations in the data, methodology, or
analysis. Address any sources of bias or uncertainty.
Recommendations and Next Steps: Offer actionable recommendations based on your
insights and propose what actions or analyses should follow this EDA.

8. References and Data Sources 📚


Dataset Links: Provide direct links to the sources of the data. Verify that the links are
accessible and up-to-date.
Research Papers or Articles Referenced: List external sources, research papers, or articles
you referenced during the analysis, with proper citations.

9. Project Files 📂
EDA Code Files: List and organize the code files, scripts, or notebooks used for different
stages of the analysis. Include comments and explanations in the code.
Data Files: Specify the data files used, including their names, formats, and descriptions.
Presentations or Reports: Include the final presentations, reports, or documents created for
communication and sharing of your EDA project results.

Conclusion
Exploratory Data Analysis is a fundamental step in the data science workflow. This guide
provides a comprehensive framework to help you navigate the process effectively, from
project initiation to data exploration, hypothesis testing, and beyond. By following these
steps, you’ll be better equipped to uncover valuable insights and make informed decisions
based on your data.

You might also like