BAM2 Handout

Module
02
BUSINESS ANALYTICS FUNDAMENTAL
DATA | STATISTICS
Nov 2021
CONTENTS
01 Data Sources & Data Type
02 Data Preparation
03 Statistics Fundamental
01
DATA SOURCES &

DATA TYPE
DATA
Data is a recording of a fact.

It might be a number, a text, an image, an audio clip or a transcription, among other things. It simply states that something
happened
Data is a raw and unorganized fact that required to be interpreted, by a human or machine, to derive meaning.
So, data is meaningless
Information is a set of data which is processed in a meaningful way according to the given requirement.
Information is processed, structured, or presented in a given context to make it meaningful and useful.
DATA SOURCES
Government &
Internal data Market data Online platform
open data
Data from government or Data from online platform

Data of the market where
Data is captured by your public sources which is such as social media
your company operates or
organizational processes accessible to everyone and become more & more
is interested in
is often free to use important to business
• Economics and politics
• Transaction data
data
• Production data • Financial data
• Customer demand • Social media
• Customer profile • Population & Society
• Competitors • Google trends
• Employee data
• Market trend • Weather & natural • ….
• Sensor record conditions
• …
• … • …
DATA TYPE & SCALE
Data in
Analytics
Unstructured &
Structured data Semi structured
data
Categorical Numerical
Textual Multimedia Xml/Json
data data
Nominal Ordinal Interval Ratio Image Video Audio

STRUCTURED DATA
Nominal scale Ordinal scale Interval scale Ratio scale
• Nominal scale contain • Ordinal scale contain codes • Interval scale corresponds to a • Any variable for which the
measurements of simple codes assigned to objects or events variable in which the value is ratios can be computed and
assigned to objects as labels, as labels that also represent chosen from an interval set. In are meaningful is called ratio
which are not measurements. the rank order among them interval scale, there is not an scale. Most variables come
For example: the variable For example, Likert scale which absolute zero value. under this type; for example:
demand for a product, market
Gender can be generally can be found in many survey Variable such as temperature
share of a brand, sales, salary,
categorized as (1) Male, (2) data is Ordinal data. Assume
measured in centigrade (!C)
Female that a feedback is collected on and so on
score are examples of interval
a training program using 5-
scale
point Likert scale in which 1 =
Poor, 2 = Fair, 3 = Good, 4 =
Very Good, and 5 = Excellent
SEMI STRUCTURED DATA
XML
• Extensible Markup Language (XML) is a markup language that
defines a set of rules for encoding documents in a format that is
both human-readable and machine-readable.
• The design goals of XML focus on simplicity, generality, and

usability across the Internet. It’s originally designed to carry
data, not to display data
JSON
• JavaScript Object Notation (JSON) is a standard text-based
format for representing structured data.
• Json is a lightweight data-interchange format. It is commonly

used for transmitting data in web applications (e.g., sending
some data from the server to the client)
UNSTRUCTURED DATA
Text as data Image as data
Audio as data
02
DATA
PREPARATION
ANALYTICS READY
Data projects which overlook data related tasks often end up with wrong answer for the right problem.
Essentially, data has to be analytics ready, which means:
• Data has to comply with some basic usability & quality metrics
• Not all data is useful for all tasks, data has to match with the task for which it is intended to be used
• Data has to have certain structure in place with key fields/ variables with properly normalized values
• There must be an organization-wide agreed on definition for common variables and subject matters
REQUIREMENT FOR ANALYTICS READY
Data validity
Data
relevancy
Data source
Following are the most reliability
Data
prevailing metrics to ensure granularity
data quality & excellent
analytics readiness. Data content
accuracy
Specific application/ project ANALYTICS
Data READY
would require different levels of
currency/
emphasis paid on these metric data
timeliness Data
dimensions and may add more accessibility
specific ones
Data
consistency Data
Data security &
richness data privacy
Refer to Business Intelligence, Analytics, and Data Science A Managerial Perspective by Ramesh Sharda, Dursun Delen, Efraim Turban
REQUIREMENT FOR ANALYTICS READY (2)
Data sources reliability prefers to the originality and appropriateness of the source where data is obtained.
prefers to the accuracy and completeness of the data, given the uses they are intended for, and
Data content accuracy are not subject to inappropriate alteration.
means that the data easily and readily obtainable – answering the question of “Can we easily get
Data accessibility to the data when we need to”
Data security & means that the data is secure to only allow those people who have authority and the need to
data privacy access it and to prevent anyone else from reaching it
means that the available variables/ data fields are rich enough to portray the underlying subject
Data richness matter for an accurate and worthy analytics study
REQUIREMENT FOR ANALYTICS READY (3)
means that for a given variable in across datasets, the same data value represents the same
Data consistency meaning, and vice versa. Data consistency also require consistent format
means that the data should be recorded at or near the time when the events occur. Besides, the
Data currency/ data timeliness data need to be up-to-date (or as new as it needs to be) for a given analytics model.
requires that the variables and data values be defined at the lowest (or as low as required) level
Data granularity of detail for the intended use of the data.
describes a match/ mismatch between the actual and expected data values of a given data
Data validity variable
evaluates relevancy level of a data variable to the study being conducted. One important things
Data relevancy that all studies should avoid is to include much irrelevant data into the analytics as this may result
in inaccurate and misleading results
DATA INTEGRATION FRAMEWORK
DIF information architecture is

process of gathering data that
is scattered inside and outside
an enterprise and transforming
it into information that the
business uses to operate and
plan for the future
DATA INTEGRATION FRAMEWORK
BUSINESS
DATA
DATA SOURCES INTELLIGENCE &
WAREHOUSE ANALYTICS
DATA MART
Transaction Company
table dashboard
Dashboard for
Customer table Sales team
Dashboard for
Product table Marketing
Other
Other table dashboard
DATA PREPARATION
Data preparation is the core set of processes for data profiling/ data integration that gather data from diverse source systems, transform
it according to business and technical rules, and stage it for later steps in its life cycle when it becomes information used by business
consumers.
Data Data Data

Data cleaning transformation
gathering Consolidation
Store data
Data is stored to
MULTI DATA make it available for
SOURCES further processing in
the data
architecture.
• Eliminate
• Collect data • Reformat data • Discretize/
duplicates
• Select data • Consolidate and Aggregate data
• Reduce noise/
• Integrate data standardize data • Create
outlier attributes
• …
• Handle missing
• ….
value
• …
Data preparation processes are the lion’s share of the work of BI/ BA project—estimated at 60 to 75% of the project time.
Project delays and cost overruns are frequently tied to underestimating the amount of time and resources necessary to complete data
preparation or, even more frequently, to do the rework necessary when the project initially skimps on these activities and then data
consistency, accuracy, and quality issues arise
DATA PREPARATION
DATA CONSOLIDATION
Data in wrong format

which need to be
reformatted
18
DATA PREPARATION
DATA CONSOLIDATION
Data which are not

consistent
DATA PREPARATION
DATA CLEANING
Noise/ Outlier
Missing value
DATA PREPARATION
DATA TRANSFORMATION
Discrete data Aggregate

data
Customer segment Min Total spent Max total spent

Member - 1,000,000
Silver 1,000,001 3,000,000
Gold 3,000,001 5,000,000
Diamond 5,000,000 Unlimited
03
STATISTICS
FUNDAMENTAL
WHY STATISTICS (1)
Avg. liking score: 8.1 Avg. liking score: 7.9 -0.2 +0.2
7.7 7.9 8.1
7.9 8.1 8.3
-0.2 +0.2
Our product is better than competitor's ?
WHY STATISTICS (2)
Statistics is a collection of mathematical techniques to characterize and interpret data.

Statistical knowledge helps us use the proper methods to collect the data, employ the correct analyses, and effectively
present the results:
- Ascertain the accuracy of your measurement system
- Avoid bias in sample selection
- Determine whether an observed phenomenon is real, or whether it's the result of pure chance
- Detect relationship among variable. Determine whether there is an association between two or more variables
….
POPULATION & SAMPLE
POPULATION & SAMPLE
Population (also known as universal set) is the set of all possible data for a given context whereas
sample is the subset taken from a population.
• Usually, it is difficult (sometime practically impossible) to

collect data from population, thus we make inference about
the population based on the sample data.
• There are many challenges in sampling (process of selecting
an observation from the population). An incorrect sample may
result in bias and incorrect inference about the population.
• The science of statistics involves drawing conclusions or
inferring about the population based on the measurements
obtained from the sample.
• Before using any dataset, it’s crucial to clearly define the
Target audience 500 dataset is population or sample. Conclusions are made from
HCMC female 18 – 50y.o HCMC female 18 – 50y.o population can be applied for sample, but the opposite is not
necessarily true
POPULATION & SAMPLE
You work in a retail company & you have all transaction data (including
who buy which by when) collected from POS system
1. Customer database which you can access is sample or population?
2. Sales data which you can access is sample or population?
DESCRIPTIVE STATISTICS
DESCRIPTIVE STATISTICS
Descriptive statistics describes basic characteristics of the data at hand. Using formulas and numerical aggregations, descriptive
statistics summarizes the data, convert the numbers into meaningful & easily understandable representations.
Descriptive statistics is the heart of Descriptive analytics. However, it does not allow making conclusion beyond the sample of
the data being analyzed.
Measures of central Measures of

tendency variation Percentile
Standard
Mean Median Mode Range Variance Deviation Percentile
CASE STUDY
Hey, how much Store Revenue

is the revenue (Mil. VND) Onaverage
On average, ,
A fashion company which
of our stores Store 1 3,554 each store
owns 10 stores
last month? 2.5 bil
achieves 2.1
Store 2 359 VND
Store 3 4,470
Store 4 4,815
Store 5 587
Store 6 199
Store 7 282
Store 8 1,434
Store 9 4,818
Store 10 478
CENTRAL TENDENCY - MEAN
Mean is the arithmetical average value of the data.
Mean is one of the most frequently used measures of central tendency because:
- It represent the value that is most common in the dataset
- It includes every data in the dataset as part of the calculation thus it reflects the whole
dataset
Store Revenue
(Mil. VND)
- The mean is not often one of the actual values that you
Store 1 3,554
have observed in your dataset
Store 2 359
- There is a famous joke in statistics which says that, “if
Store 3 4,470
On average, someone’s head is in freezer and leg is in the oven, the
Store 4 4,815 each store
average body temperature would be fine, but the person may
achieves 2.1 bil
Store 5 587 VND not be alive” à Making decisions solely based on mean value
is not advisable
Store 6 199
Store 7 282 - It is affected significantly by presence of outliers
Store 8 1,434
Store 9 4,818
Store 10 478
- Mean is affected significantly by presence of outliers
Store Revenue Store Revenue

(Mil. VND) (Mil. VND)
Store 1 3,554 Store 1 35,540
Store 2 359 Store 2 359
Store 3 4,470 Store 3 4,470
Store 4 4,815 Store 4 4,815
Store 5 587 Mean = 2.1 Bil. VND Store 5 587 Mean = 5.3 Bil. VND
Store 8 1,434 Store 8 1,434
Store 9 4,818 Store 9 4,818

CENTRAL TENDENCY – MEDIAN (1)

Store 1 35,540 Store 6 199
Store 3 4,470 Store 2 359
Store 4 4,815 Store 10 478
Store 5 587 Store 5 587 Extract the value at midpoint of the dataset
Store 6 199 Store 8 1,434 =( 587 + 1,434)/2 = 1,010
Store 7 282 Store 3 4,470 Meaning:
• 50% of store have revenue above this number
Store 8 1,434 Store 4 4,815 • 50% of store have revenue below this number
Store 9 4,818 Store 9 4,818
Store 10 478 Store 1 35,540

Median is the middle score for a set of data that has been arranged in order of magnitude
Median = 56
14 35 45 55 55 56 56 65 87 89 92
14 35 45 55 55 56 56 65 87 89
Median = (55 + 56)/2 = 55.5

The median is less affected by outliers and skewed data,

Mean Median
0 5 10 15 20 25
thus it doesn’t reflect as well as the shape/ distribution of dataset as mean
Median = 56 Median = 56
0 20 40 60 80 100 0 20 40 60 80 100
14 35 45 55 55 56 56 65 87 89 92 1 5 45 55 55 56 56 65 87 89 100
CASE STUDY
Customer Gender Age Most of our

customers are
Customer 1 Male 25 female, from 25
Who are our
key customers? Customer 2 Female 40 – 30 y.o
Customer 3 Female 36
Customer 4 Male 27
10
28 5
Customer 8 Female
Customer 9 Male 30
Customer 10 Female 25 Male Female
8
Customer 13 Male 53 3
2
1 1
18 - 24 25 - 30 31 - 35 36 - 40 Above 40
Customer 15 Male 30
CENTRAL TENDENCY - MODE
Mode is the observation that occurs most frequently in the data set.
o Sometimes mode is not unique, so it leaves us with problems when we have two or more values that share the highest frequency
8 8
3
2
1
18 - 24 25 - 30 31 - 35 36 - 40 Above 40
o Mode will not provide us with a good measure of central tendency when the most common mark is far away from the rest of the
data in the data set
10
8
6
5
2
1 1
18 - 24 25 - 30 31 - 35 36 - 40 41 - 45 46 - 50 Above 51
CENTRAL TENDENCY - MODE
o Mode is useful for the datasets that contain a relatively small number of unique values. It may be useless if the datasets have too many
unique values
Customer Gender Age

Customer 1 Male 25
2 2 2
Customer 4 Male 27
Customer 5 Female 20 1 1 1 1 1 1 1 1 1
18 19 20 25 26 27 28 30 34 36 40 53
Customer 9 Male 30
8
Customer 13 Male 53 3
2
1 1
Customer 15 Male 30 18 - 24 25 - 30 31 - 35 36 - 40 Above 40
MEASURES OF VARIATION – RANGE
Range is the difference between maximum and minimum value of the data. It captures the data spread
In below dataset: Range = 102 – 2=100
2 3 5 9 21 93 99 99 99 102
MEASURES OF VARIATION – VARIANCE & STANDARD DEVIATION
ROSE DAISY
Store 1 3,554 Store 1 2,554
Store 2 359 Store 2 1,359
Store 3 4,470 Store 3 3,470
Store 4 4,815 Store 4 3,815
Store 5 587 Mean = 2.1 Bil. VND Store 5 1,087 Mean = 2.1 Bil. VND
Store 8 1,434 Store 8 934
Store 9 4,818 Store 9 3,818
Store 10 478 Store 10 1,478

Variance is a measure of variability in the data from the mean value. The larger the variance, the more the data
are spread out from the mean and the more variability we can observe in the data sample
s 2 denote Variance for population

µ denotes population mean
ROSE: Variance = 3.780.338 DAISY: Variance = 1.277.838

6.000
6.000
5.000
5.000
4.000
4.000
3.000
3.000
2.000
2.000
1.000
1.000
-
-
0 2 4 6 8 10 12
0 2 4 6 8 10 12
Standard deviation (s) is the square root of variance
ROSE: Variance = 3.780.338 à Standard deviation = 1.944
DAISY: Variance = 1.277.838 à Standard deviation = 1.130
v Population variance v Sample variance
v Population standard deviation v Sample standard deviation
(μ denotes population mean) ( denotes sample mean)

PERCENTILE
Percentile, denoted as Px , is the value of the data at which x percentage of the data lie below that value.
For example, P20 denotes the value below which 20% of the data lies
P20 = 14.4
Mil. VND P60 = 43.6
Bottom 20% SKU Mil. VND
Revenue (Mil. VND)
100
80
60
40
20
0
8
1
25
24
19
10
16
22
11
13
23
15
12
20
21
18
14
17
U
U
U
U
K
K
K
K
S
S
S
S
PERCENTILE
Decile corresponds to special values of percentile that divide the data into 10 equal parts. First decile contains first 10% of the data and
second decile contains first 20% of the data and so on.
Quartile divides the data into 4 equal parts. The first quartile (Q1 ) contains first 25% of the data, Q2 contains 50% of the data and is also
the median. Quartile 3 (Q3 ) accounts for 75% of the data
Q2
Q1 Q3
Bottom 25% SKU Revenue (Mil. VND)
100
80
60
40
20
0
8
1
25
24
19
10
16
22
11
13
23
15
12
20
21
18
14
17
U
U
U
U
K
K
K
K
S
S
S
S
Application
- Percentile & Quartile are used to detect outlier
- To identify the position of an observation in the group: product in top 20 best seller products, student in top 10…
APPLIED DESCRIPTIVE STATISTICS
Measurement Applicable for
Mean Ordinal(*), Interval & Ratio
Median Ordinal(*), Interval & Ratio
Mode Nominal, Ordinal
Range Ordinal, Interval & Ratio
Variance Ordinal(*), Interval & Ratio
Standard deviation Ordinal(*), Interval & Ratio
Percentile Ratio
(*) Use carefully

EXCEL FORMULA
Measurement Excel formula
Population mean
=average(number1, number2..)
Sample mean
Median =median(number1, number2..)
Mode =mode.sngl (number1, number2..)
Population variance =var.p (number1, number2..)
Sample variance =var.p (number1, number2..)
Population Standard deviation =stdev.p (number1, number2..)
Sample Standard deviation =stdev.s (number1, number2..)
=percentile.exc(array,x), or
Percentile Px
=percentile.inc(array,x)
DISTRIBUTION
DISTRIBUTION
A distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for
an experiment
Identify distribution of a sample is an important task, it allow us to discover hidden pattern of the population,
then predict value which are hidden or has not collected
DISTRIBUTION
Distribution of Age in Vietnam 2021

10,0%
{
9,0%
8,0%
8% if age=0-4
7.8% if age = 5-9
7,0% …
p= …
Frequency
6,0%
5,0%
0.03% if age = 100+
4,0%
3,0%
2,0%
1,0%
0,0%
0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
Age group
Source: Statista
DISTRIBUTION
DISTRIBUTION OF IQ SCORE IN THE WORLD

Frequency
$"%&& !
1 "(
%'
)!
𝑓 𝑥 = ∗𝑒 )
15 2π
IQ score Source: Wikipedia

COMMON DISTRIBUTION
Distribution of a dataset which includes Distribution of a dataset which includes

a finite or countably infinite set of values an infinite set of values
Discrete Continuous
distribution distributions
Normal
Binomial Poisson Geometric Uniform Exponential distribution/ Chi-square Student’s t- F-
distribution distribution distribution distribution distribution Gaussian distribution distribution distribution
distribution
THANK YOU
53

BAM2 Handout

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BAM2 Handout

Uploaded by

Copyright:

Available Formats

Module

01 Data Sources & Data Type

DATA SOURCES &

Data is a recording of a fact.

Data from government or Data from online platform

Nominal Ordinal Interval Ratio Image Video Audio

Nominal scale Ordinal scale Interval scale Ratio scale

• The design goals of XML focus on simplicity, generality, and

• Json is a lightweight data-interchange format. It is commonly

DIF information architecture is

Data Data Data

Data in wrong format

Data which are not

Discrete data Aggregate

Customer segment Min Total spent Max total spent

7.7 7.9 8.1

7.9 8.1 8.3

Statistics is a collection of mathematical techniques to characterize and interpret data.

• Usually, it is difficult (sometime practically impossible) to

Measures of central Measures of

Hey, how much Store Revenue

Mean is the arithmetical average value of the data.

Store 7 282 - It is affected significantly by presence of outliers

- Mean is affected significantly by presence of outliers

Store Revenue Store Revenue

Store 3 4,470 Store 3 4,470

Store 4 4,815 Store 4 4,815

Store 7 282 Store 7 282

Store 8 1,434 Store 8 1,434

Store 9 4,818 Store 9 4,818

Store 10 478 Store 10 478

Store Revenue Store Revenue

Store 2 359 Store 7 282

Store 3 4,470 Store 2 359

Store 4 4,815 Store 10 478

Store 10 478 Store 1 35,540

Median = (55 + 56)/2 = 55.5

The median is less affected by outliers and skewed data,

thus it doesn’t reflect as well as the shape/ distribution of dataset as mean

Customer Gender Age Most of our

Customer Gender Age

Store 2 359 Store 2 1,359

Store 3 4,470 Store 3 3,470

Store 4 4,815 Store 4 3,815

Store 7 282 Store 7 1,282

Store 8 1,434 Store 8 934

Store 9 4,818 Store 9 3,818

Store 10 478 Store 10 1,478

s 2 denote Variance for population

ROSE: Variance = 3.780.338 DAISY: Variance = 1.277.838

Standard deviation (s) is the square root of variance

ROSE: Variance = 3.780.338 à Standard deviation = 1.944

DAISY: Variance = 1.277.838 à Standard deviation = 1.130

v Population variance v Sample variance

v Population standard deviation v Sample standard deviation

(μ denotes population mean) ( denotes sample mean)

- Percentile & Quartile are used to detect outlier

Measurement Applicable for

Mean Ordinal(*), Interval & Ratio

Median Ordinal(*), Interval & Ratio

Mode Nominal, Ordinal

Range Ordinal, Interval & Ratio

Variance Ordinal(*), Interval & Ratio