Professional Documents
Culture Documents
Exploratory Data Bagus Priambodo
Exploratory Data Bagus Priambodo
Data
Introduction to Data Mining
Bagus Priambodo
Purpose
• The main purpose of EDA is to help look at data before making any
assumptions. It can help identify obvious errors, as well as better
understand patterns within the data, detect outliers or anomalous
events, find interesting relations among the variables.
• Viewing summary statistics, checking for missing values, and understanding
the data types of each column, we can gain valuable insights into our
dataset.
• Additionally, visualizing the data helps us uncover patterns and trends that
may not be apparent from the raw numbers alone. Data exploration is an
essential step in any data analysis project, as it forms the foundation for
further analysis and modeling.
Identify Variables
Descriptive Analytics
1) Providing basic information about variables in a
dataset and
2) Highlighting potential relationships between
variables.
Descriptive Statistics
• Descriptive statistics is about describing and summarizing data. It
uses two main approaches:
1.The quantitative approach describes and summarizes data
numerically.
2.The visual approach illustrates data with charts, plots, histograms,
and other graphs.
Univariate vs. Bivariate
• When you describe and summarize a single variable, you’re
performing univariate analysis.
• When you search for statistical relationships among a pair of
variables, you’re doing a bivariate analysis.
Univariate Bivariate
Involves only one variable Involves two variables
Deals with causes or
Doesn't deal with relationships or causes
relationships
The prime purpose of
bivariate is explaining:
•Correlations: Comparisons,
The prime purpose of univariate is describing: explanations, causes,
•Dispersion: variance, range, standard deviation, relationships
quartiles, maximum, minimum / Distribution •Dependent and
•Central tendency: mean median, and mode independent variables
•Bar graph, pie chart, histogram, box-and-whisker •Tables where just one
plot, line graph variable is dependent on
other variables' values
•Simultaneous analysis of
two variables
Univariate
• Distribution (Also Called Frequency Distribution)
•Here as you can notice mean value is less than median value of each
column which is represented by 50%(50th percentile) in index column.
•There is notably a large difference between 75th %tile and max
values of predictors “residual sugar”,”free sulfur dioxide”,”total sulfur
dioxide”.
•Thus observations 1 and 2 suggests that there are extreme values-
Outliers in our data set.
Wine dataset
Example Feature Selection
Dataset
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Result of Prediction
Normal Distribution and Linear Regression
• Normality and linearity are related but distinct concepts in statistics. Normality
refers to the distribution of a variable being normally distributed, meaning that the
data is symmetrical around the mean and follows a bell-shaped curve. Linearity
refers to a linear relationship between two variables, meaning that a change in one
variable is associated with a proportional change in the other variable.
• In many statistical analyses, normality and linearity assumptions are required to be
met in order to ensure accurate and meaningful results. However, they are not
interchangeable concepts and a variable can be normally distributed without
having a linear relationship with another variable, and vice versa.
• A normal distribution can be used to assess the linear relationship between two
variables through the use of correlation analysis, even if the variables themselves
are not normally distributed. However, it is important to note that non-normality of
the data can affect the reliability of the results, and non-parametric correlation
tests may be more appropriate in such cases. Additionally, it is important to check
the assumptions of any statistical method used and to use appropriate techniques
for analyzing data that violate normality and linearity assumptions.