You are on page 1of 53

Module

02
BUSINESS ANALYTICS FUNDAMENTAL
DATA | STATISTICS
Nov 2021
CONTENTS

01 Data Sources & Data Type

02 Data Preparation

03 Statistics Fundamental
01

DATA SOURCES &


DATA TYPE
DATA

Data is a recording of a fact.


It might be a number, a text, an image, an audio clip or a transcription, among other things. It simply states that something
happened

Data is a raw and unorganized fact that required to be interpreted, by a human or machine, to derive meaning.
So, data is meaningless

Information is a set of data which is processed in a meaningful way according to the given requirement.
Information is processed, structured, or presented in a given context to make it meaningful and useful.
DATA SOURCES

Government &
Internal data Market data Online platform
open data

Data from government or Data from online platform


Data of the market where
Data is captured by your public sources which is such as social media
your company operates or
organizational processes accessible to everyone and become more & more
is interested in
is often free to use important to business
• Economics and politics
• Transaction data
data
• Production data • Financial data
• Customer demand • Social media
• Customer profile • Population & Society
• Competitors • Google trends
• Employee data
• Market trend • Weather & natural • ….
• Sensor record conditions
• …
• … • …
DATA TYPE & SCALE

Data in
Analytics

Unstructured &
Structured data Semi structured
data

Categorical Numerical
Textual Multimedia Xml/Json
data data

Nominal Ordinal Interval Ratio Image Video Audio


STRUCTURED DATA

Nominal scale Ordinal scale Interval scale Ratio scale

• Nominal scale contain • Ordinal scale contain codes • Interval scale corresponds to a • Any variable for which the
measurements of simple codes assigned to objects or events variable in which the value is ratios can be computed and
assigned to objects as labels, as labels that also represent chosen from an interval set. In are meaningful is called ratio
which are not measurements. the rank order among them interval scale, there is not an scale. Most variables come

For example: the variable For example, Likert scale which absolute zero value. under this type; for example:
demand for a product, market
Gender can be generally can be found in many survey Variable such as temperature
share of a brand, sales, salary,
categorized as (1) Male, (2) data is Ordinal data. Assume
measured in centigrade (!C)
Female that a feedback is collected on and so on
score are examples of interval
a training program using 5-
scale
point Likert scale in which 1 =
Poor, 2 = Fair, 3 = Good, 4 =
Very Good, and 5 = Excellent
SEMI STRUCTURED DATA

XML
• Extensible Markup Language (XML) is a markup language that
defines a set of rules for encoding documents in a format that is
both human-readable and machine-readable.

• The design goals of XML focus on simplicity, generality, and


usability across the Internet. It’s originally designed to carry
data, not to display data

JSON
• JavaScript Object Notation (JSON) is a standard text-based
format for representing structured data.

• Json is a lightweight data-interchange format. It is commonly


used for transmitting data in web applications (e.g., sending
some data from the server to the client)
UNSTRUCTURED DATA
Text as data Image as data

Audio as data
02

DATA
PREPARATION
ANALYTICS READY

Data projects which overlook data related tasks often end up with wrong answer for the right problem.
Essentially, data has to be analytics ready, which means:

• Data has to comply with some basic usability & quality metrics

• Not all data is useful for all tasks, data has to match with the task for which it is intended to be used

• Data has to have certain structure in place with key fields/ variables with properly normalized values

• There must be an organization-wide agreed on definition for common variables and subject matters
REQUIREMENT FOR ANALYTICS READY

Data validity
Data
relevancy
Data source
Following are the most reliability
Data
prevailing metrics to ensure granularity
data quality & excellent
analytics readiness. Data content
accuracy
Specific application/ project ANALYTICS
Data READY
would require different levels of
currency/
emphasis paid on these metric data
timeliness Data
dimensions and may add more accessibility
specific ones
Data
consistency Data
Data security &
richness data privacy

Refer to Business Intelligence, Analytics, and Data Science A Managerial Perspective by Ramesh Sharda, Dursun Delen, Efraim Turban
REQUIREMENT FOR ANALYTICS READY (2)

Data sources reliability prefers to the originality and appropriateness of the source where data is obtained.

prefers to the accuracy and completeness of the data, given the uses they are intended for, and
Data content accuracy are not subject to inappropriate alteration.

means that the data easily and readily obtainable – answering the question of “Can we easily get
Data accessibility to the data when we need to”

Data security & means that the data is secure to only allow those people who have authority and the need to
data privacy access it and to prevent anyone else from reaching it

means that the available variables/ data fields are rich enough to portray the underlying subject
Data richness matter for an accurate and worthy analytics study

Refer to Business Intelligence, Analytics, and Data Science A Managerial Perspective by Ramesh Sharda, Dursun Delen, Efraim Turban
REQUIREMENT FOR ANALYTICS READY (3)

means that for a given variable in across datasets, the same data value represents the same
Data consistency meaning, and vice versa. Data consistency also require consistent format

means that the data should be recorded at or near the time when the events occur. Besides, the
Data currency/ data timeliness data need to be up-to-date (or as new as it needs to be) for a given analytics model.

requires that the variables and data values be defined at the lowest (or as low as required) level
Data granularity of detail for the intended use of the data.

describes a match/ mismatch between the actual and expected data values of a given data
Data validity variable

evaluates relevancy level of a data variable to the study being conducted. One important things
Data relevancy that all studies should avoid is to include much irrelevant data into the analytics as this may result
in inaccurate and misleading results

Refer to Business Intelligence, Analytics, and Data Science A Managerial Perspective by Ramesh Sharda, Dursun Delen, Efraim Turban
DATA INTEGRATION FRAMEWORK

DIF information architecture is


process of gathering data that
is scattered inside and outside
an enterprise and transforming
it into information that the
business uses to operate and
plan for the future
DATA INTEGRATION FRAMEWORK

BUSINESS
DATA
DATA SOURCES INTELLIGENCE &
WAREHOUSE ANALYTICS

DATA MART

Transaction Company
table dashboard

Dashboard for
Customer table Sales team

Dashboard for
Product table Marketing

Other
Other table dashboard
DATA PREPARATION

Data preparation is the core set of processes for data profiling/ data integration that gather data from diverse source systems, transform
it according to business and technical rules, and stage it for later steps in its life cycle when it becomes information used by business
consumers.

Data Data Data


Data cleaning transformation
gathering Consolidation

Store data
Data is stored to
MULTI DATA make it available for
SOURCES further processing in
the data
architecture.
• Eliminate
• Collect data • Reformat data • Discretize/
duplicates
• Select data • Consolidate and Aggregate data
• Reduce noise/
• Integrate data standardize data • Create
outlier attributes
• …
• Handle missing
• ….
value
• …

Data preparation processes are the lion’s share of the work of BI/ BA project—estimated at 60 to 75% of the project time.
Project delays and cost overruns are frequently tied to underestimating the amount of time and resources necessary to complete data
preparation or, even more frequently, to do the rework necessary when the project initially skimps on these activities and then data
consistency, accuracy, and quality issues arise
DATA PREPARATION
DATA CONSOLIDATION

Data in wrong format


which need to be
reformatted

18
DATA PREPARATION
DATA CONSOLIDATION

Data which are not


consistent
DATA PREPARATION
DATA CLEANING

Noise/ Outlier

Missing value
DATA PREPARATION
DATA TRANSFORMATION

Discrete data Aggregate


data

Customer segment Min Total spent Max total spent


Member - 1,000,000
Silver 1,000,001 3,000,000
Gold 3,000,001 5,000,000
Diamond 5,000,000 Unlimited
03

STATISTICS
FUNDAMENTAL
WHY STATISTICS (1)

Avg. liking score: 8.1 Avg. liking score: 7.9 -0.2 +0.2

7.7 7.9 8.1

7.9 8.1 8.3

-0.2 +0.2
Our product is better than competitor's ?
WHY STATISTICS (2)

Statistics is a collection of mathematical techniques to characterize and interpret data.


Statistical knowledge helps us use the proper methods to collect the data, employ the correct analyses, and effectively
present the results:
- Ascertain the accuracy of your measurement system
- Avoid bias in sample selection
- Determine whether an observed phenomenon is real, or whether it's the result of pure chance
- Detect relationship among variable. Determine whether there is an association between two or more variables
….
POPULATION & SAMPLE
POPULATION & SAMPLE

Population (also known as universal set) is the set of all possible data for a given context whereas
sample is the subset taken from a population.

• Usually, it is difficult (sometime practically impossible) to


collect data from population, thus we make inference about
the population based on the sample data.
• There are many challenges in sampling (process of selecting
an observation from the population). An incorrect sample may
result in bias and incorrect inference about the population.
• The science of statistics involves drawing conclusions or
inferring about the population based on the measurements
obtained from the sample.
• Before using any dataset, it’s crucial to clearly define the
Target audience 500 dataset is population or sample. Conclusions are made from
HCMC female 18 – 50y.o HCMC female 18 – 50y.o population can be applied for sample, but the opposite is not
necessarily true
POPULATION & SAMPLE

You work in a retail company & you have all transaction data (including
who buy which by when) collected from POS system
1. Customer database which you can access is sample or population?
2. Sales data which you can access is sample or population?
DESCRIPTIVE STATISTICS
DESCRIPTIVE STATISTICS

Descriptive statistics describes basic characteristics of the data at hand. Using formulas and numerical aggregations, descriptive
statistics summarizes the data, convert the numbers into meaningful & easily understandable representations.
Descriptive statistics is the heart of Descriptive analytics. However, it does not allow making conclusion beyond the sample of
the data being analyzed.

Measures of central Measures of


tendency variation Percentile

Standard
Mean Median Mode Range Variance Deviation Percentile
CASE STUDY

Hey, how much Store Revenue


is the revenue (Mil. VND) Onaverage
On average, ,
A fashion company which
of our stores Store 1 3,554 each store
owns 10 stores
last month? 2.5 bil
achieves 2.1
Store 2 359 VND

Store 3 4,470

Store 4 4,815

Store 5 587

Store 6 199

Store 7 282

Store 8 1,434

Store 9 4,818

Store 10 478
CENTRAL TENDENCY - MEAN

Mean is the arithmetical average value of the data.

Mean is one of the most frequently used measures of central tendency because:
- It represent the value that is most common in the dataset
- It includes every data in the dataset as part of the calculation thus it reflects the whole
dataset
CENTRAL TENDENCY - MEAN

Store Revenue
(Mil. VND)
- The mean is not often one of the actual values that you
Store 1 3,554
have observed in your dataset
Store 2 359
- There is a famous joke in statistics which says that, “if
Store 3 4,470
On average, someone’s head is in freezer and leg is in the oven, the
Store 4 4,815 each store
average body temperature would be fine, but the person may
achieves 2.1 bil
Store 5 587 VND not be alive” à Making decisions solely based on mean value
is not advisable
Store 6 199

Store 7 282 - It is affected significantly by presence of outliers

Store 8 1,434

Store 9 4,818

Store 10 478
CENTRAL TENDENCY - MEAN

- Mean is affected significantly by presence of outliers

Store Revenue Store Revenue


(Mil. VND) (Mil. VND)
Store 1 3,554 Store 1 35,540
Store 2 359 Store 2 359

Store 3 4,470 Store 3 4,470

Store 4 4,815 Store 4 4,815

Store 5 587 Mean = 2.1 Bil. VND Store 5 587 Mean = 5.3 Bil. VND
Store 6 199 Store 6 199

Store 7 282 Store 7 282

Store 8 1,434 Store 8 1,434

Store 9 4,818 Store 9 4,818

Store 10 478 Store 10 478


CENTRAL TENDENCY – MEDIAN (1)

Store Revenue Store Revenue


(Mil. VND) (Mil. VND)
Store 1 35,540 Store 6 199

Store 2 359 Store 7 282

Store 3 4,470 Store 2 359

Store 4 4,815 Store 10 478

Store 5 587 Store 5 587 Extract the value at midpoint of the dataset
Store 6 199 Store 8 1,434 =( 587 + 1,434)/2 = 1,010
Store 7 282 Store 3 4,470 Meaning:
• 50% of store have revenue above this number
Store 8 1,434 Store 4 4,815 • 50% of store have revenue below this number
Store 9 4,818 Store 9 4,818

Store 10 478 Store 1 35,540


CENTRAL TENDENCY – MEDIAN (2)

Median is the middle score for a set of data that has been arranged in order of magnitude
Median = 56

14 35 45 55 55 56 56 65 87 89 92

14 35 45 55 55 56 56 65 87 89

Median = (55 + 56)/2 = 55.5


CENTRAL TENDENCY – MEDIAN (3)

The median is less affected by outliers and skewed data,


Mean Median

0 5 10 15 20 25

thus it doesn’t reflect as well as the shape/ distribution of dataset as mean

Median = 56 Median = 56

0 20 40 60 80 100 0 20 40 60 80 100

14 35 45 55 55 56 56 65 87 89 92 1 5 45 55 55 56 56 65 87 89 100
CASE STUDY

Customer Gender Age Most of our


customers are
Customer 1 Male 25 female, from 25
Who are our
key customers? Customer 2 Female 40 – 30 y.o
Customer 3 Female 36
Customer 4 Male 27
Customer 5 Female 20
Customer 6 Female 26
10
Customer 7 Female 19
28 5
Customer 8 Female
Customer 9 Male 30
Customer 10 Female 25 Male Female

Customer 11 Female 18
8
Customer 12 Female 34
Customer 13 Male 53 3
2
1 1
Customer 14 Female 27
18 - 24 25 - 30 31 - 35 36 - 40 Above 40
Customer 15 Male 30
CENTRAL TENDENCY - MODE

Mode is the observation that occurs most frequently in the data set.
o Sometimes mode is not unique, so it leaves us with problems when we have two or more values that share the highest frequency

8 8

3
2
1

18 - 24 25 - 30 31 - 35 36 - 40 Above 40

o Mode will not provide us with a good measure of central tendency when the most common mark is far away from the rest of the
data in the data set

10
8
6
5
2
1 1

18 - 24 25 - 30 31 - 35 36 - 40 41 - 45 46 - 50 Above 51
CENTRAL TENDENCY - MODE

o Mode is useful for the datasets that contain a relatively small number of unique values. It may be useless if the datasets have too many
unique values

Customer Gender Age


Customer 1 Male 25
Customer 2 Female 40
Customer 3 Female 36
2 2 2
Customer 4 Male 27
Customer 5 Female 20 1 1 1 1 1 1 1 1 1

Customer 6 Female 26
Customer 7 Female 19
18 19 20 25 26 27 28 30 34 36 40 53
Customer 8 Female 28
Customer 9 Male 30
Customer 10 Female 25
Customer 11 Female 18
8
Customer 12 Female 34
Customer 13 Male 53 3
2
1 1
Customer 14 Female 27
Customer 15 Male 30 18 - 24 25 - 30 31 - 35 36 - 40 Above 40
MEASURES OF VARIATION – RANGE

Range is the difference between maximum and minimum value of the data. It captures the data spread
In below dataset: Range = 102 – 2=100

2 3 5 9 21 93 99 99 99 102
MEASURES OF VARIATION – VARIANCE & STANDARD DEVIATION

ROSE DAISY
Store Revenue Store Revenue
(Mil. VND) (Mil. VND)
Store 1 3,554 Store 1 2,554

Store 2 359 Store 2 1,359

Store 3 4,470 Store 3 3,470

Store 4 4,815 Store 4 3,815

Store 5 587 Mean = 2.1 Bil. VND Store 5 1,087 Mean = 2.1 Bil. VND
Store 6 199 Store 6 1,199

Store 7 282 Store 7 1,282

Store 8 1,434 Store 8 934

Store 9 4,818 Store 9 3,818

Store 10 478 Store 10 1,478


MEASURES OF VARIATION – VARIANCE & STANDARD DEVIATION

Variance is a measure of variability in the data from the mean value. The larger the variance, the more the data
are spread out from the mean and the more variability we can observe in the data sample

s 2 denote Variance for population


µ denotes population mean

ROSE: Variance = 3.780.338 DAISY: Variance = 1.277.838


6.000
6.000

5.000
5.000

4.000
4.000

3.000
3.000

2.000
2.000

1.000
1.000

-
-
0 2 4 6 8 10 12
0 2 4 6 8 10 12
MEASURES OF VARIATION – VARIANCE & STANDARD DEVIATION

Standard deviation (s) is the square root of variance

ROSE: Variance = 3.780.338 à Standard deviation = 1.944

DAISY: Variance = 1.277.838 à Standard deviation = 1.130

v Population variance v Sample variance

v Population standard deviation v Sample standard deviation

(μ denotes population mean) ( denotes sample mean)


PERCENTILE

Percentile, denoted as Px , is the value of the data at which x percentage of the data lie below that value.
For example, P20 denotes the value below which 20% of the data lies

P20 = 14.4
Mil. VND P60 = 43.6
Bottom 20% SKU Mil. VND
Revenue (Mil. VND)
100
80

60
40
20

0
8

1
25

24

19

10

16

22

11

13

23

15

12

20

21

18

14

17
U

U
U

U
K

K
K

K
S

S
S

S
PERCENTILE

Decile corresponds to special values of percentile that divide the data into 10 equal parts. First decile contains first 10% of the data and
second decile contains first 20% of the data and so on.

Quartile divides the data into 4 equal parts. The first quartile (Q1 ) contains first 25% of the data, Q2 contains 50% of the data and is also
the median. Quartile 3 (Q3 ) accounts for 75% of the data

Q2
Q1 Q3
Bottom 25% SKU Revenue (Mil. VND)
100

80
60

40
20
0
8

1
25

24

19

10

16

22

11

13

23

15

12

20

21

18

14

17
U

U
U

U
K

K
K

K
S

S
S

S
Application

- Percentile & Quartile are used to detect outlier

- To identify the position of an observation in the group: product in top 20 best seller products, student in top 10…
APPLIED DESCRIPTIVE STATISTICS

Measurement Applicable for

Mean Ordinal(*), Interval & Ratio

Median Ordinal(*), Interval & Ratio

Mode Nominal, Ordinal

Range Ordinal, Interval & Ratio

Variance Ordinal(*), Interval & Ratio

Standard deviation Ordinal(*), Interval & Ratio

Percentile Ratio

(*) Use carefully


EXCEL FORMULA

Measurement Excel formula

Population mean
=average(number1, number2..)
Sample mean

Median =median(number1, number2..)

Mode =mode.sngl (number1, number2..)

Population variance =var.p (number1, number2..)

Sample variance =var.p (number1, number2..)

Population Standard deviation =stdev.p (number1, number2..)

Sample Standard deviation =stdev.s (number1, number2..)

=percentile.exc(array,x), or
Percentile Px
=percentile.inc(array,x)
DISTRIBUTION
DISTRIBUTION

A distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for
an experiment

Identify distribution of a sample is an important task, it allow us to discover hidden pattern of the population,
then predict value which are hidden or has not collected
DISTRIBUTION

Distribution of Age in Vietnam 2021


10,0%

{
9,0%

8,0%
8% if age=0-4
7.8% if age = 5-9
7,0% …
p= …
Frequency

6,0%

5,0%
0.03% if age = 100+
4,0%

3,0%
2,0%
1,0%

0,0%
0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
Age group

Source: Statista
DISTRIBUTION

DISTRIBUTION OF IQ SCORE IN THE WORLD


Frequency

$"%&& !
1 "(
%'
)!
𝑓 𝑥 = ∗𝑒 )
15 2π

IQ score Source: Wikipedia


COMMON DISTRIBUTION

Distribution of a dataset which includes Distribution of a dataset which includes


a finite or countably infinite set of values an infinite set of values

Discrete Continuous
distribution distributions

Normal
Binomial Poisson Geometric Uniform Exponential distribution/ Chi-square Student’s t- F-
distribution distribution distribution distribution distribution Gaussian distribution distribution distribution
distribution
THANK YOU

53

You might also like