You are on page 1of 29

Exploratory

Data
Introduction to Data Mining
Bagus Priambodo
Purpose
• The main purpose of EDA is to help look at data before making any
assumptions. It can help identify obvious errors, as well as better
understand patterns within the data, detect outliers or anomalous
events, find interesting relations among the variables.
• Viewing summary statistics, checking for missing values, and understanding
the data types of each column, we can gain valuable insights into our
dataset.
• Additionally, visualizing the data helps us uncover patterns and trends that
may not be apparent from the raw numbers alone. Data exploration is an
essential step in any data analysis project, as it forms the foundation for
further analysis and modeling.
Identify Variables
Descriptive Analytics
1) Providing basic information about variables in a
dataset and
2) Highlighting potential relationships between
variables.
Descriptive Statistics
• Descriptive statistics is about describing and summarizing data. It
uses two main approaches:
1.The quantitative approach describes and summarizes data
numerically.
2.The visual approach illustrates data with charts, plots, histograms,
and other graphs.
Univariate vs. Bivariate
• When you describe and summarize a single variable, you’re
performing univariate analysis.
• When you search for statistical relationships among a pair of
variables, you’re doing a bivariate analysis.
Univariate Bivariate
Involves only one variable Involves two variables
Deals with causes or
Doesn't deal with relationships or causes
relationships
The prime purpose of
bivariate is explaining:
•Correlations: Comparisons,
The prime purpose of univariate is describing: explanations, causes,
•Dispersion: variance, range, standard deviation, relationships
quartiles, maximum, minimum / Distribution •Dependent and
•Central tendency: mean median, and mode independent variables
•Bar graph, pie chart, histogram, box-and-whisker •Tables where just one
plot, line graph variable is dependent on
other variables' values
•Simultaneous analysis of
two variables
Univariate
• Distribution (Also Called Frequency Distribution)

Understanding the distribution of your data can help you


choose the right statistical test, identify outliers, check for
normality, and visualize the data. By understanding the
distribution of your data, you can ensure that your results
are accurate, reliable, and valid.
Skewness
Skewed data describes an uneven distribution of data in which the values
are not spread out evenly. The majority of the data, therefore, falls on
either the left or right side of the distribution accordingly.
Causes of skewed data

Skewed data can be caused by various factors, including:


• Outliers: Outliers are extreme values that differ from most data points.
When outliers exist in our data, our data is skewed.
• Measurement errors: Inaccuracies or errors in the measurement process
can introduce skewness to our data.
• Natural variability: In some cases, the real-world examples naturally have
more data points on one side than the other resulting in skewness.
• Transformation or data manipulation: Skewness can also be introduced
when we perform certain data transformations or manipulations.
Outliers
Outliers Drop them or not
• DROP: If it is obvious that the outlier is due to incorrectly entered or
measured data, you should drop the outlier.
Example : a field year : ‘9999’ or ‘0000’
• DROP: If the outlier does not change the results but does affect
assumptions, you may drop the outlier.
Outliers Drop them or Not
• DROP AND EXPLAIN WHY: More commonly, the outlier affects both
results and assumptions. In this situation, it is not legitimate to
simply drop the outlier.
Outliers Drop them or Not
• DROP: If the outlier creates a significant association, you should
drop it

In the following graph, the relationship between X


and Y is clearly created by the outlier. Without it,
there is no relationship between X and Y, so the
regression coefficient does not truly describe the
effect of X on Y.
What to do when we shouldn’t drop the
outlier?
Try a different model: This should be done with
caution, but it may be that a non-linear model fits
better.
Example : Neural Network, KNN
Example Outliers
• https://archive.ics.uci.edu/ml/datasets/wine+quality
Feature Selection
• Due to their linear dependence, two highly correlated variables can
have nearly the same ability to predict the outcome value for an
observation (Vishal, 2018). Removing one of the correlated variables
before training the model benefits the learning process and can result
in a similar performance to the full model
Case Study
• https://archive.ics.uci.edu/ml/datasets/wine+quality
Wine dataset

•Here as you can notice mean value is less than median value of each
column which is represented by 50%(50th percentile) in index column.
•There is notably a large difference between 75th %tile and max
values of predictors “residual sugar”,”free sulfur dioxide”,”total sulfur
dioxide”.
•Thus observations 1 and 2 suggests that there are extreme values-
Outliers in our data set.
Wine dataset
Example Feature Selection
Dataset
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Relationship between dependent and
independent variables
Result of Prediction
Normal Distribution and Linear Regression
• Normality and linearity are related but distinct concepts in statistics. Normality
refers to the distribution of a variable being normally distributed, meaning that the
data is symmetrical around the mean and follows a bell-shaped curve. Linearity
refers to a linear relationship between two variables, meaning that a change in one
variable is associated with a proportional change in the other variable.
• In many statistical analyses, normality and linearity assumptions are required to be
met in order to ensure accurate and meaningful results. However, they are not
interchangeable concepts and a variable can be normally distributed without
having a linear relationship with another variable, and vice versa.
• A normal distribution can be used to assess the linear relationship between two
variables through the use of correlation analysis, even if the variables themselves
are not normally distributed. However, it is important to note that non-normality of
the data can affect the reliability of the results, and non-parametric correlation
tests may be more appropriate in such cases. Additionally, it is important to check
the assumptions of any statistical method used and to use appropriate techniques
for analyzing data that violate normality and linearity assumptions.

You might also like