Professional Documents
Culture Documents
02
BUSINESS ANALYTICS FUNDAMENTAL
DATA | STATISTICS
Nov 2021
CONTENTS
02 Data Preparation
03 Statistics Fundamental
01
Data is a raw and unorganized fact that required to be interpreted, by a human or machine, to derive meaning.
So, data is meaningless
Information is a set of data which is processed in a meaningful way according to the given requirement.
Information is processed, structured, or presented in a given context to make it meaningful and useful.
DATA SOURCES
Government &
Internal data Market data Online platform
open data
Data in
Analytics
Unstructured &
Structured data Semi structured
data
Categorical Numerical
Textual Multimedia Xml/Json
data data
• Nominal scale contain • Ordinal scale contain codes • Interval scale corresponds to a • Any variable for which the
measurements of simple codes assigned to objects or events variable in which the value is ratios can be computed and
assigned to objects as labels, as labels that also represent chosen from an interval set. In are meaningful is called ratio
which are not measurements. the rank order among them interval scale, there is not an scale. Most variables come
For example: the variable For example, Likert scale which absolute zero value. under this type; for example:
demand for a product, market
Gender can be generally can be found in many survey Variable such as temperature
share of a brand, sales, salary,
categorized as (1) Male, (2) data is Ordinal data. Assume
measured in centigrade (!C)
Female that a feedback is collected on and so on
score are examples of interval
a training program using 5-
scale
point Likert scale in which 1 =
Poor, 2 = Fair, 3 = Good, 4 =
Very Good, and 5 = Excellent
SEMI STRUCTURED DATA
XML
• Extensible Markup Language (XML) is a markup language that
defines a set of rules for encoding documents in a format that is
both human-readable and machine-readable.
JSON
• JavaScript Object Notation (JSON) is a standard text-based
format for representing structured data.
Audio as data
02
DATA
PREPARATION
ANALYTICS READY
Data projects which overlook data related tasks often end up with wrong answer for the right problem.
Essentially, data has to be analytics ready, which means:
• Data has to comply with some basic usability & quality metrics
• Not all data is useful for all tasks, data has to match with the task for which it is intended to be used
• Data has to have certain structure in place with key fields/ variables with properly normalized values
• There must be an organization-wide agreed on definition for common variables and subject matters
REQUIREMENT FOR ANALYTICS READY
Data validity
Data
relevancy
Data source
Following are the most reliability
Data
prevailing metrics to ensure granularity
data quality & excellent
analytics readiness. Data content
accuracy
Specific application/ project ANALYTICS
Data READY
would require different levels of
currency/
emphasis paid on these metric data
timeliness Data
dimensions and may add more accessibility
specific ones
Data
consistency Data
Data security &
richness data privacy
Refer to Business Intelligence, Analytics, and Data Science A Managerial Perspective by Ramesh Sharda, Dursun Delen, Efraim Turban
REQUIREMENT FOR ANALYTICS READY (2)
Data sources reliability prefers to the originality and appropriateness of the source where data is obtained.
prefers to the accuracy and completeness of the data, given the uses they are intended for, and
Data content accuracy are not subject to inappropriate alteration.
means that the data easily and readily obtainable – answering the question of “Can we easily get
Data accessibility to the data when we need to”
Data security & means that the data is secure to only allow those people who have authority and the need to
data privacy access it and to prevent anyone else from reaching it
means that the available variables/ data fields are rich enough to portray the underlying subject
Data richness matter for an accurate and worthy analytics study
Refer to Business Intelligence, Analytics, and Data Science A Managerial Perspective by Ramesh Sharda, Dursun Delen, Efraim Turban
REQUIREMENT FOR ANALYTICS READY (3)
means that for a given variable in across datasets, the same data value represents the same
Data consistency meaning, and vice versa. Data consistency also require consistent format
means that the data should be recorded at or near the time when the events occur. Besides, the
Data currency/ data timeliness data need to be up-to-date (or as new as it needs to be) for a given analytics model.
requires that the variables and data values be defined at the lowest (or as low as required) level
Data granularity of detail for the intended use of the data.
describes a match/ mismatch between the actual and expected data values of a given data
Data validity variable
evaluates relevancy level of a data variable to the study being conducted. One important things
Data relevancy that all studies should avoid is to include much irrelevant data into the analytics as this may result
in inaccurate and misleading results
Refer to Business Intelligence, Analytics, and Data Science A Managerial Perspective by Ramesh Sharda, Dursun Delen, Efraim Turban
DATA INTEGRATION FRAMEWORK
BUSINESS
DATA
DATA SOURCES INTELLIGENCE &
WAREHOUSE ANALYTICS
DATA MART
Transaction Company
table dashboard
Dashboard for
Customer table Sales team
Dashboard for
Product table Marketing
Other
Other table dashboard
DATA PREPARATION
Data preparation is the core set of processes for data profiling/ data integration that gather data from diverse source systems, transform
it according to business and technical rules, and stage it for later steps in its life cycle when it becomes information used by business
consumers.
Store data
Data is stored to
MULTI DATA make it available for
SOURCES further processing in
the data
architecture.
• Eliminate
• Collect data • Reformat data • Discretize/
duplicates
• Select data • Consolidate and Aggregate data
• Reduce noise/
• Integrate data standardize data • Create
outlier attributes
• …
• Handle missing
• ….
value
• …
Data preparation processes are the lion’s share of the work of BI/ BA project—estimated at 60 to 75% of the project time.
Project delays and cost overruns are frequently tied to underestimating the amount of time and resources necessary to complete data
preparation or, even more frequently, to do the rework necessary when the project initially skimps on these activities and then data
consistency, accuracy, and quality issues arise
DATA PREPARATION
DATA CONSOLIDATION
18
DATA PREPARATION
DATA CONSOLIDATION
Noise/ Outlier
Missing value
DATA PREPARATION
DATA TRANSFORMATION
STATISTICS
FUNDAMENTAL
WHY STATISTICS (1)
Avg. liking score: 8.1 Avg. liking score: 7.9 -0.2 +0.2
-0.2 +0.2
Our product is better than competitor's ?
WHY STATISTICS (2)
Population (also known as universal set) is the set of all possible data for a given context whereas
sample is the subset taken from a population.
You work in a retail company & you have all transaction data (including
who buy which by when) collected from POS system
1. Customer database which you can access is sample or population?
2. Sales data which you can access is sample or population?
DESCRIPTIVE STATISTICS
DESCRIPTIVE STATISTICS
Descriptive statistics describes basic characteristics of the data at hand. Using formulas and numerical aggregations, descriptive
statistics summarizes the data, convert the numbers into meaningful & easily understandable representations.
Descriptive statistics is the heart of Descriptive analytics. However, it does not allow making conclusion beyond the sample of
the data being analyzed.
Standard
Mean Median Mode Range Variance Deviation Percentile
CASE STUDY
Store 3 4,470
Store 4 4,815
Store 5 587
Store 6 199
Store 7 282
Store 8 1,434
Store 9 4,818
Store 10 478
CENTRAL TENDENCY - MEAN
Mean is one of the most frequently used measures of central tendency because:
- It represent the value that is most common in the dataset
- It includes every data in the dataset as part of the calculation thus it reflects the whole
dataset
CENTRAL TENDENCY - MEAN
Store Revenue
(Mil. VND)
- The mean is not often one of the actual values that you
Store 1 3,554
have observed in your dataset
Store 2 359
- There is a famous joke in statistics which says that, “if
Store 3 4,470
On average, someone’s head is in freezer and leg is in the oven, the
Store 4 4,815 each store
average body temperature would be fine, but the person may
achieves 2.1 bil
Store 5 587 VND not be alive” à Making decisions solely based on mean value
is not advisable
Store 6 199
Store 8 1,434
Store 9 4,818
Store 10 478
CENTRAL TENDENCY - MEAN
Store 5 587 Mean = 2.1 Bil. VND Store 5 587 Mean = 5.3 Bil. VND
Store 6 199 Store 6 199
Store 5 587 Store 5 587 Extract the value at midpoint of the dataset
Store 6 199 Store 8 1,434 =( 587 + 1,434)/2 = 1,010
Store 7 282 Store 3 4,470 Meaning:
• 50% of store have revenue above this number
Store 8 1,434 Store 4 4,815 • 50% of store have revenue below this number
Store 9 4,818 Store 9 4,818
Median is the middle score for a set of data that has been arranged in order of magnitude
Median = 56
14 35 45 55 55 56 56 65 87 89 92
14 35 45 55 55 56 56 65 87 89
0 5 10 15 20 25
Median = 56 Median = 56
0 20 40 60 80 100 0 20 40 60 80 100
14 35 45 55 55 56 56 65 87 89 92 1 5 45 55 55 56 56 65 87 89 100
CASE STUDY
Customer 11 Female 18
8
Customer 12 Female 34
Customer 13 Male 53 3
2
1 1
Customer 14 Female 27
18 - 24 25 - 30 31 - 35 36 - 40 Above 40
Customer 15 Male 30
CENTRAL TENDENCY - MODE
Mode is the observation that occurs most frequently in the data set.
o Sometimes mode is not unique, so it leaves us with problems when we have two or more values that share the highest frequency
8 8
3
2
1
18 - 24 25 - 30 31 - 35 36 - 40 Above 40
o Mode will not provide us with a good measure of central tendency when the most common mark is far away from the rest of the
data in the data set
10
8
6
5
2
1 1
18 - 24 25 - 30 31 - 35 36 - 40 41 - 45 46 - 50 Above 51
CENTRAL TENDENCY - MODE
o Mode is useful for the datasets that contain a relatively small number of unique values. It may be useless if the datasets have too many
unique values
Customer 6 Female 26
Customer 7 Female 19
18 19 20 25 26 27 28 30 34 36 40 53
Customer 8 Female 28
Customer 9 Male 30
Customer 10 Female 25
Customer 11 Female 18
8
Customer 12 Female 34
Customer 13 Male 53 3
2
1 1
Customer 14 Female 27
Customer 15 Male 30 18 - 24 25 - 30 31 - 35 36 - 40 Above 40
MEASURES OF VARIATION – RANGE
Range is the difference between maximum and minimum value of the data. It captures the data spread
In below dataset: Range = 102 – 2=100
2 3 5 9 21 93 99 99 99 102
MEASURES OF VARIATION – VARIANCE & STANDARD DEVIATION
ROSE DAISY
Store Revenue Store Revenue
(Mil. VND) (Mil. VND)
Store 1 3,554 Store 1 2,554
Store 5 587 Mean = 2.1 Bil. VND Store 5 1,087 Mean = 2.1 Bil. VND
Store 6 199 Store 6 1,199
Variance is a measure of variability in the data from the mean value. The larger the variance, the more the data
are spread out from the mean and the more variability we can observe in the data sample
5.000
5.000
4.000
4.000
3.000
3.000
2.000
2.000
1.000
1.000
-
-
0 2 4 6 8 10 12
0 2 4 6 8 10 12
MEASURES OF VARIATION – VARIANCE & STANDARD DEVIATION
Percentile, denoted as Px , is the value of the data at which x percentage of the data lie below that value.
For example, P20 denotes the value below which 20% of the data lies
P20 = 14.4
Mil. VND P60 = 43.6
Bottom 20% SKU Mil. VND
Revenue (Mil. VND)
100
80
60
40
20
0
8
1
25
24
19
10
16
22
11
13
23
15
12
20
21
18
14
17
U
U
U
U
K
K
K
K
S
S
S
S
PERCENTILE
Decile corresponds to special values of percentile that divide the data into 10 equal parts. First decile contains first 10% of the data and
second decile contains first 20% of the data and so on.
Quartile divides the data into 4 equal parts. The first quartile (Q1 ) contains first 25% of the data, Q2 contains 50% of the data and is also
the median. Quartile 3 (Q3 ) accounts for 75% of the data
Q2
Q1 Q3
Bottom 25% SKU Revenue (Mil. VND)
100
80
60
40
20
0
8
1
25
24
19
10
16
22
11
13
23
15
12
20
21
18
14
17
U
U
U
U
K
K
K
K
S
S
S
S
Application
- To identify the position of an observation in the group: product in top 20 best seller products, student in top 10…
APPLIED DESCRIPTIVE STATISTICS
Percentile Ratio
Population mean
=average(number1, number2..)
Sample mean
=percentile.exc(array,x), or
Percentile Px
=percentile.inc(array,x)
DISTRIBUTION
DISTRIBUTION
A distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for
an experiment
Identify distribution of a sample is an important task, it allow us to discover hidden pattern of the population,
then predict value which are hidden or has not collected
DISTRIBUTION
{
9,0%
8,0%
8% if age=0-4
7.8% if age = 5-9
7,0% …
p= …
Frequency
6,0%
5,0%
0.03% if age = 100+
4,0%
3,0%
2,0%
1,0%
0,0%
0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
Age group
Source: Statista
DISTRIBUTION
$"%&& !
1 "(
%'
)!
𝑓 𝑥 = ∗𝑒 )
15 2π
Discrete Continuous
distribution distributions
Normal
Binomial Poisson Geometric Uniform Exponential distribution/ Chi-square Student’s t- F-
distribution distribution distribution distribution distribution Gaussian distribution distribution distribution
distribution
THANK YOU
53