You are on page 1of 33

PROBLEM SOLVING IN

BUSINESS MANAGEMENT
Session 4
Exploratory Data Analysis
M.Sc. Thien Nguyen
Email: thien.nguyen@isb.edu.vn
Phone: 0949088908
Agenda
1. Introduction to EDA
2. Common Analyses
Part I
1. What is EDA?
2. The Typical Cycle of EDA
Introduction To
Exploratory Data Analysis

3
I. Introduction
Review: What is Data Analytics?

Source: Google Data Analytics Course (https://www.coursera.org)


4
I. Introduction
Review: What is Data Analytics?

Four Types (Levels) Of Data Analytics - Is a simple, surface-level type of analysis


based on historical data to examine,
understand, and describe what happened
Descriptive Analytics
- Uses BI and visualization tools to summarize
(Phân tích mô tả)
the data, or discover trends and patterns
- E.g.: Have the number of customers gone
up? Are sales better this month than last?
- Tries to uncover causal relationships
- May involve seeking to identify anomalies
Diagnostic Analytics
within the data
(Phân tích chẩn đoán)
- E.g.: Did the latest marketing campaign
impact sales?
- Is based-on historical data, past trends, and
Predictive Analytics
assumptions to predict future outcomes
(Phân tích dự đoán)
- Uses machine learning models
- Tries to find out and suggest what individuals
or organizations should do to obtain future
Prescriptive Analytics
targets/goals
(Phân tích đề nghị)
- Uses predictive analytics to show results of
different scenarios

Others: cognitive analytics, behavioral analytics, risk analytics...


5
Typical Steps in a Data Analytics Project

Source: https://en.wikipedia.org/wiki/Data_analysis
6
I. Introduction
What is EDA?

➤EDA is the method of studying and exploring a dataset to deeply understand it


➤EDA can be done when we prepare data

The data can be:

➤Raw (not processed yet)


➤Not-cleaned
➤Including missing data
➤Redundant and duplicated

Source: Exploratory Data Analysis - An Important Step in Data Science


8
I. Introduction
What is EDA?

➤EDA can be considered as a set of statistical techniques for:


● Exploring
● Describing
● Summarising the nature of the data

Source: A Practical Introductory Guide to Exploratory Data Analysis

9
I. Introduction
What is EDA?

“The main purpose of


EDA is to help look at
the data before
making any
assumptions”

Source: What is exploratory data analysis? | IBM

10
I. Introduction
What is EDA?

Typical Things To Do in EDA:

➤Screen the data to understand each data field (column)


➤Identify possible errors
➤Reveal the presence of outliers
➤Check the relationship between variables (correlations, casuals)
➤Make descriptive analysis: univariate, bivariate and multivariate analysis
To summarize the most significant aspects

11
I. Introduction to EDA
Typical Cycle

➤EDA is an iterative process, in which we start by making questions

Generating Use the answers to


Finding answers by
questions (or refine the questions
analyzing, modeling
hypotheses) about and generate new
and visualizing
the data questions

Source: https://r4ds.had.co.nz/exploratory-data-analysis.html
12
Summary
What categories?
With Data Fields
(columns) Frequency of each
Name, meaning, Qualitative category?
relationship?
Descriptive measures
of each category
Data type? Making Questions
EDA Univariate analysis: & Answering
No. of missing - Descriptive measures
Quantitative - Outliers? Abnormals?
values?
Multivariate analysis:
Common errors? Covariance? Correlation?
Duplicates?
Qualitative &
Quantitative Inferential statistics:
Regression? Clustering?
Analysis

13
Source: https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/
Note:
EDA is not formulated with
a set of defined rules.
It depends on yourself!

15
Part III
1. Univariate Analysis
2. Detecting Outliers
3. Multivariate Analysis
4. Regression Analysis
Common Analyses

16
III Common Analyses
1. Univariate Analysis

➤Quantitative analysis: Univariate vs. Bivariate vs. Multivariate

17
III Common Analyses
1. Univariate Analysis

➤Fundamental Measures in Univariate Analysis

Measures of frequency
Number of Occurrences, Percentage
(độ đo về tần số)

Measures of central tendency


Mean, Median (trung vị), Mode (yếu vị)
(độ đo về khuynh hướng tập trung)

Measures of spread (dispersion/variability) Range, Variance & Standard Deviation, Standard


(độ đo về sự mở rộng) Error

Measures of position Percentiles & Quantiles, Quartiles (tứ phân vị),


(độ đo về phân vị) Standard Scores

Measures of shape Skewness (độ lệch)/ Kurtosis (độ gù, độ nhọn),


(độ đo về hình dạng phân bố) Normal Distribution

18
Univariate
Analysis
Most important
measures
Qualitative Quantitative
(categorical & discrete) (discrete & continuous)
Variables Variables

Measures of Measures of
Measures of Measures of Measures of
Central Spread/
Frequency Position Shape
Tendency Dispersion

● Max/ Min/ Range


● Charts / Graphs ● Mean ● Variance ● Quartile
● Skewness
● Counts ● Median ● Standard ● Quantile
● Kurtosis
● Percentages ● Mode Deviation ● Ranking
● Standard Error

19
III Common Analyses
2. Detecting outliers

➤It depends on domain knowledge


➤A simple way is to use boxplot

In statistics, a point is considered


outliers as an outlier if its Z-score > 3.0
(is far from the mean more than
three times of std)

20
Detecting Outliers

Source: https://r-graph-gallery.com/boxplot.html 21
III.3 Basic Multivariate Analyses

➤ Bivariate Analysis (phân tích nhị/song biến)


➤ Multivariate Analysis (phân tích đa biến)

Bi-/ Multi-variate
Analysis

Qualitative Quantitative
(categorical & discrete) (discrete & continuous)
Variables Variables

Frequency Table Covariance,


Contingency Tables
Cross-tabulation Correlation,
(Pivot Table of SUM,
(or Contingency
MEAN, MEDIAN…) Regression
Table of Counting)
22
III.3 Basic Multivariate Analyses
(1) Contingency Table

Contingency Table (Bảng 2 chiều, Bảng Phát Sinh, Bảng Tương Quan)

➤Used to summarize and analyse relationships between 02 categorical variables

➤Cross-tabulation (crosstab, VN: bảng chéo): a simple way of summarizing frequency


(COUNT, PERCENTAGE)
⇒ NOTE: Only for Qualitative (Categorical) columns

Gender
Female Male Sub-Total
Branch
Da Nang 117 93 210
Ha Noi 143 130 273
HCM City 277 240 517
Total 537 463 1000

Total number of sales in each city


23
III.3 Basic Multivariate Analyses
(1) Contingency Table

Contingency Table (VN: Bảng 2 chiều, Bảng Phát Sinh, Bảng Tương Quan)

➤If we combine a qualitative variable with a quantitative variable


● In Excel: PivotTable
● Measures: SUM, AVERAGE, MEDIAN, STD…

Gender
Female Male Sub-Total
Branch
Da Nang 39,155.36 27,994.39 67,149.75
Ha Noi 47,664.66 37,759.64 85,424.30
HCM City 94,923.93 75,469.45 170,393.38
Total 181,743.95 141,223.48 322,967.43

Total revenue in each city


24
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Covariance (VN: hiệp phương sai): measure the relationship between two random
variables and how they change together (or how they move relative to each other)

➤ When Xs are moving away from mean-of-Xs, how Ys move away from mean-of-Ys
➤ 2 types: positive covariance vs. negative covariance
➤ Range of covariance value: -∞ < Cov(x,y) < +∞

25
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Correlation (VN: sự tương quan):

➤ Covariance cannot show if a relationship is "strong" or "weak"


➤ Correlation: a normalized version of covariance
⇒ Measures both the strength and direction of the linear relationship
➤ Correlation coefficient (Pearson):

26
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Correlation:

➤ Correlation coefficient indicates a ratio, and has no unit


➤ Value: -1 < corrcoef < 1

27
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Strength of relationship:

28
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

Some cases with CC = 0:

29
III.3 Basic Multivariate Analyses
(2) Covariance & Correlation

➤Use scatter plot


➤Calculate correlation coefficient

30
III Common Analyses
4. Regression

➤Find the trendline and regression formula

31
III Common Analyses
4. Regression

➤Find the regression formula

Oops! It seems like


customers with higher
payment are less happy!

32
THANK YOU

You might also like