SB 2024 Lecture3

Session 3: Descriptive Statistics
Statistics for Business

Dr. Le Anh Tuan
4
Numerical Description
Three key characteristics of numerical data:
► Center: the extent to which the values of a numerical variable
group around a typical, or central, value.
► Where are the data values concentrated?
► What seem to be typical or middle data values? Is there
central tendency?
► Variability: the amount of dispersion, or scattering, away from a
central value that the values of a numerical variable show
► How much dispersion is there in the data?
► How spread out are the data values?
► Are there unusual values?
► Shape: the pattern of the distribution of values from the lowest
value to the highest value.
► Are the data values distributed symmetrically? Skewed?
Sharply peaked? Flat? Bimodal?
5
Measures of Central Tendency
6
Measures of Central Tendency
► Central tendency is a single value used to describe the center
point of a data set.
Central Tendency
Mean Median Mode
Weighted
Mean
7
Mean
► The mean, or average, is the most common measure of
central tendency.
► In statistics, we generally use the term mean instead of

average, and the mean has a specific formula:
► Calculate the mean by adding all the values in a data set and
the diving the results by the number of observations.
8
Mean
► For the population:
N
åx i
x1 + x 2 + ! + x N Population values
μ= i=1
=
N N
Population size
► For the sample:
n
Observed values
åx i
x1 + x 2 + ! + x n
x= i=1
=
n n
Sample size
9
Mean
► Suppose a sample size n=10 gives the following values:
4 3 8 10 0 2 3 7 5 2
► The sample mean is
4 + 3 + 8 + 10 + 0 + 2 + 3 + 7 + 5 + 2
"̅ =
10
44
= = 4.4
10
10
Mean
► Advantages:
► Easy to calculate
► Summarizes the data with a single value
► Disadvantages:
► Affected by outliers (values that are much higher or
lower than most of the data)
11
Mean
► Affected by outliers (values that are much higher or lower than
most of the data)
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1 + 2 + 3 + 4 + 10 20
= =4
5 5
1 + 2 + 3 + 4 + 5 15
= =3
5 5
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
12
Weighted Mean
► A weighted mean allows you to assign more weight to
certain values and less weight to others.
► Formula for the weighted mean:
∑(%&'(*% "% )
"̅ =
∑(%&' *%
► "% = ith data value

► *% = the weight for each data value x-
► ∑(%&' *% = the sum of all the weights.
Weighted Mean
► A GPA is a weighted average that assigns greater weight to
courses with more credits. Grades typically range from 0 to 4.
For example, if A earned a grade of 4.0 in Science (worth 4
credits), 3 in English (worth 3 credits), and 3.5 in Physics
(worth 2 credits), what is A's GPA?
► Solutions:
► Multiply each grade by its weight, add the products,
and then divide by the sum of the weights.
4×4 + 3×3 + 3.5×2 32

"̅ = =
4+3+2 9
= 3.56
► A’s grade point average is approximately 3.56.
14
Questions
► You invest in the stock market, the returns are significantly
influenced by market conditions. The details are as follows:
Stock returns
Market condition Likelihood Returns
Up 45% 20%
Neutral 20% 14%
Down 35% -5%
1. Calculate expected return for this investment?

2. If you invest $US 1000, calculate the expected monetary
value.
15
The Median
16
Median
► The median is the value in the data set for which half
observations are higher and half the observations are
lower.
► In other words, the median (M) is the 50th percentile or

midpoint of the ordered sample data.
17
Median
► Not Affected by outliers
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
0 1 2 3 4 5 6 7 8 9 10
18
Median
► The location of the median:
n +1
Median position = position in the ordered data
2
Where n is the number of observations.
► If the number of observations is odd, the median

is the middle number.
► If the number of observations is even, the median
is the average of the two middle numbers.
► Note that (n+1)/2 is not the value of the median, only the
position of the median in the ranked data
19
Median
► Suppose a sample size n=9 with order gives the following
values:
0 3 4 7 9 12 13 17 25
► The median position is
9+1
=5
2
► The median value is 9
20
Median
► Suppose a sample size n=10 with order gives the following
values:
0 3 4 7 9 12 13 17 25 4000
► The median position is
10 + 1 Not affected
= 5.5 by outliers
2
► The median value is

9 + 12
= 10.5
2
21
The Mode
23
Mode
►The mode is the value that appears most often in a data

set.
►If no data value or category repeats more than once,
we say that the mode does not exist.
►More than one mode can exist if two or more values

tie for the most frequent.
►The mode is most useful for discrete or categorical data

with only a few distinct data values. For continuous data or
data with a wide range, the mode is rarely useful.
24
Mode
0 1 2 3 4 5 6
No Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
25
Shape of a Distribution
►Describes how data are distributed
►Measures of shape
►Symmetric
►Skewed (Left or Right)
Mean < Median Mean = Median Median < Mean
Left-Skewed Symmetric Right-Skewed
27
Skewness
1 4
% 2. − 2̅
!"#$%#&& = -
(% − 1)(% − 2) &
./0
!"#$%#&& < 0 !"#$%#&& = 0 !"#$%#&& > 0
28
Questions
►The number of students in 10 classes was recorded. The
stem and leaf diagram with 10’s digit for the stem shows
this information:
1. What percentage of classes have more than 20 students?

2. How many classes are there that have fewer than 5 students?
3. Find the mean, median, and mode
4. Determine the shape of distribution
5. A class of 40 students is added to the sample. The teacher
says the median will remain the same. Is the teacher correct?
Please explain your answer.
29
Questions
►The Dot Plot provides information on the number of cities
each student in a class has visited.
1. How many observations are there in the sample?

2. How many students visit more than 10 cities?
3. Find the mean, median, and mode
4. Determine the shape of distribution
30
Measures of Variability
31
Measures of Variability
Variation
Range Variance Standard deviation
►Measures of variation give

information on the spread or
variability of the data values.
►Variability provides a quantitative
measure of the differences
between scores in a distribution
Same center,
and describes the degree to which
different variation
the scores are spread out or
clustered together.
32
Range
►Simplest measure of variation
►Difference between the largest and the smallest

observations.
Range = Highest Value – Lowest Value
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
33
Range
►Advantages
►Easy to calculate and understand
►Disadvantages
►Only based on two numbers on the dataset. Ignores
the way in which data are distributed
►Sensitive to outliers
34
Variance
►The variance is the average of the squared differences
between each data value and the mean.
►Sample variance is denoted by s2
►Where "̅ = sample mean

►n = sample size
►Xi = ith value of the variable X
35
Variance
►Population variance is denoted by
►Where ! = population mean
►N = population size
►Xi = ith value of the variable X
36
Variance
►A sample includes 8 observations:

10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16
►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
567
=
2
= 18.57
37
Standard Deviations
►Standard deviation is the square root of variance
►Has the same units as original data, making it more easily

interpreted than the variance.
►Sample standard deviation:
38
Standard deviation
►A sample includes 6 observations:

10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16
►Sample Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
567
=
2
= 18.57
Standard deviation: & = 58. :; = <. 65
A measure of how far on average each data value is
from the mean of the sample.
39
Comparing SD
40
Questions
► Scores for the TOEIC tests are presented below for a group of n =
12 students. Male 500
Male 450
Male 600
Male 520
Male 540
Male 390
Female 600
Female 700
Female 720
Female 300
Female 410
Female 270
1. Compute the variance for the entire group.

2. Which group has the highest dispersion in scores?
41
Excel and STATA
► Excel Formulas
► Mean: =Average(Data Values)
► Weighted Mean: =Sumproduct(X1, X2)
► Median: =Median(Data Values)
► Mode: =Mode(Data Values)
► Variance: = Var.S(Data Values)
► SD: =STDEV.S(Data Values)
42
Excel
► Excel Tools: Data/Data Analysis/Descriptive Statistics
43
Excel
44
Megstat
45
Grouped Data
46
Grouped Data
► Suppose data are grouped into K classes, with frequencies f1, f2,
. . . fK, and the midpoints of the classes are m1, m2, . . ., mK.
► For a sample of n observations, the mean is
n: the total number of observations

k: the number of classes
► This mean is only an approximate value since the midpoint is just

an estimate of the value in each class.
47
Grouped Data
► The following table gives the frequency distribution of the

number of orders received each morning during the past 30
mornings at a coffee store. Calculate the mean.
Number of orders Frequency (f)
0-4 4
5-9 5
10 - 14 10
15 - 19 7
20 - 24 4
n=30
48
Grouped Data
► For a sample of n observations, the variance is
50
Using the Mean and Standard
Deviation Together
52
Coefficient of Variation
►The standard deviation is affected by the scale of the data

►When sample means are very different, comparing
SD can be misleading.
►The coefficient of variance, CV, measures SD in terms of its

percentage of the mean.
►A high CV indicates high variability relative to the
size of the mean.
►A low CV indicates low variability relative to the size
of the mean.
►A smaller coefficient of variation indicates more

consistency within a set of data values.
53
Coefficient of Variation
►Sample coefficient of variation
$
!" = (100)
&̅
s=the sample standard deviation

&̅ = the sample mean
►Population coefficient of variation

+
!" = (100)
,
+ =the population standard deviation
,= the population mean
54
Coefficient of Variation example
►Stock market
Price for Stock A Price for Stock B
Mean 100 60
SD 20 15
CV =20/100*100 =15/60*100
=20% =25%
►Although stock A has a larger deviation, the price is more

consistent.
55
Z-score
►Zscore identifies the number of standard deviations a

particular value is from the mean of its distributions.
►A Zscore has no unit.
►Zscore is
►Zero for values equal to the mean
►Positive for values above the mean
►Negative for values below the mean
56
Z-score formula
►Sample Zscore
# − #̅
!=
&
x=the data value of interest
s=the sample standard deviation
#̅ = the sample mean
►Population Zscore
#−'
!=
(
( =the population standard deviation
'= the population mean
57
Unusual observations
►Based on its standardized Zscore, a data value is classified

as:
►Unusual if |Z| > 2 (beyond μ ± 2σ)
►Outlier if |Z| > 3 (beyond μ ± 3σ)
58
Z-score example
►Price for a glass of milk tea (size L) in Vietnam
15K 28K 43K 50K 70K 90K

►Average price: Mean=49.33
►SD = 27.40
►How far is the price of KOI THÉ (90K) from the sample
mean of 49.33 (in SD increments)
59
Z-score example
►A price for a glass of milk tea (size L) in Vietnam
15K 28K 43K 50K 70K 90K

►!"#$%& = ()* − ,). ..)/12. ,* = 3. ,4
►The price of KOI is more than one standard deviation (1.48)
above the sample mean.
60
Empirical rule
►If the data distribution is bell-shaped, symmetrical curve

centered around the mean, we would expect:
About 68% of the values in About 99.7% of the values in
the population to fall within ± the population to fall within ±
1 standard deviation from the 3 standard deviation from the
mean mean
68% 95% 95% 99.7%99.7%
μ
μ ± 2σμ ± 2σ μ ± 3σμ ± 3σ
μ ± 1σ
About 95% of the values in
the population to fall within ±
2 standard deviation from the
mean
61
Chebychev’s Theorem
► For any population with mean μ and standard deviation σ , and k >
1 , the percentage of observations that fall within the interval
[" + $%]
► is at least
)
1− ×100
*+
► Regardless of how the data are distributed, at least (1 - 1/k2) of the

values will fall within k standard deviations of the mean (for k > 1)
62
Relative Position
Percentiles, Quartiles, and Box Plots
63
Measures of relative position
► Measures of relation position compare the position of one

value in relation to other values in the data set.
► Measures:
► Percentiles
► Quartiles
► Interquartile
64
Percentiles
► Percentiles are data that have been divided into 100

groups.
► For example, you EQ score in the 80rd percentile on a

standardized test. That means that 80% of the test-takers
scored below you.
► Generally, the pth percentile of a data set (where p is any

number between 1 and 100) is the value that at least p
percent of the observations will fall below.
65
Percentiles
66
Percentiles
► Arrange the data in ascending order.
► Compute Lp, the location of the pth percentile.
p
LP = (n + 1)
100
67
Percentiles
► Example: The 80th percentile for the starting salary data
► Step 1. Arrange the data in ascending order.

5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050,
6130, 6325.
p
( n + 1) = æç ö÷ ( 12 + 1) = 10.4
80
► Step 2. L80 =
100 è 100 ø
► Step 3. 80th percentile = 6050 + 0.4×(6130–6050)
= 6082
68
Quartiles
► Quartiles are scale points that divide the sorted data into four
groups of approximately equal size.
► The first quartile, Q1, is the value for which 25% of the observations are
smaller and 75% are larger.
► Q2 is the same as the median (50% are smaller, 50% are larger)
► Only 25% of the observations are greater than the third quartile.
Questions?
► What is the median of the data values below Q2?
► What is the median of the data values above Q3?
Interquartile
► Interquartile range describes the middle 50% of the data.
► Can eliminate high- and low-valued observations (outliers)
► Interquartile range = 3rd quartile – 1st quartile

IQR = Q3 – Q1
Box-and-Whisker plot
► A box plot (also called a box-and-whisker plot) is a graphical
display showing the relation position of the five-number
summary:
Min , Q1 , Q2 , Q3 , Max
► It also provides outlies if any
**
Outliers
Outliers
►Formulas for the upper and lower limits of outliers
►Upper Limit = Q3 + 1.5 IQR

►Lower Limit = Q1 - 1.5 IQR
►Values beyond these limits are considered outliers
73
Examples
74
Association Between Two Variables
(Covariance , Correlation)
75
Covariance
► The covariance measures the direction of the linear relationship between two
variables.
► The population covariance:
► The sample covariance:

n
å (x - x)(y - y)
i i
Cov (x , y) = s xy = i=1
n -1
► Only concerned with the direction of the relationship (positive, negative, no
relationship)
► No causal effect is implied
76
Covariance
► Covariance between two variables:
► Cov(x,y) > 0 x and y tend to move in the same direction
► Cov(x,y) < 0 x and y tend to move in opposite directions
► Cov(x,y) = 0 x and y are independent
77
Correlation Coefficient
► The sample correlation coefficient, rxy measures both the strength and
direction of the linear relationship between two variables.
► Formula for population correlation coefficient:
Standard
deviations for the
► Formula for sample correlation coefficient: x or y variable
Cov (x , y)
r=
sX sY
78
Correlation Coefficients
► Unit free
► Ranges between –1 and 1
► The closer to –1, the stronger the negative linear relationship
► The closer to 1, the stronger the positive linear relationship
► The closer to 0, the weaker any positive linear relationship
79
Correlation Coefficients
80
Excel
► Covariance for the sample:
=COVARIANCE.S(X DATA VALUES, Y DATA VALUES)
► Correlation for the sample:

=CORREL (X DATA VALUES, Y DATA VALUES)
Excel Tools: Data/Data Analysis/Covariance

Data/Data Analysis/Correlation
81
Exercise
► Review Session 3, Online Quiz 3.
► Reading Chapter 4. Probability
82

SB 2024 Lecture3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SB 2024 Lecture3

Uploaded by

Copyright:

Available Formats

Session 3: Descriptive Statistics

Statistics for Business

Mean Median Mode

► In statistics, we generally use the term mean instead of

► The sample mean is

► Formula for the weighted mean:

► "% = ith data value

4×4 + 3×3 + 3.5×2 32

1. Calculate expected return for this investment?

► In other words, the median (M) is the 50th percentile or

► If the number of observations is odd, the median

► The median position is

► The median value is 9

► The median position is

► The median value is

►The mode is the value that appears most often in a data

►More than one mode can exist if two or more values

►The mode is most useful for discrete or categorical data

Left-Skewed Symmetric Right-Skewed

!"#$%#&& < 0 !"#$%#&& = 0 !"#$%#&& > 0

1. What percentage of classes have more than 20 students?

1. How many observations are there in the sample?

Range Variance Standard deviation

►Measures of variation give

►Simplest measure of variation

►Difference between the largest and the smallest

►Where "̅ = sample mean

►Population variance is denoted by

►Where ! = population mean

►Xi = ith value of the variable X

►A sample includes 8 observations:

►Has the same units as original data, making it more easily

►Sample standard deviation:

►A sample includes 6 observations:

1. Compute the variance for the entire group.

n: the total number of observations

► This mean is only an approximate value since the midpoint is just

► The following table gives the frequency distribution of the

► For a sample of n observations, the variance is

►The standard deviation is affected by the scale of the data

►The coefficient of variance, CV, measures SD in terms of its

►A smaller coefficient of variation indicates more

►Sample coefficient of variation

s=the sample standard deviation

►Population coefficient of variation

Price for Stock A Price for Stock B

►Although stock A has a larger deviation, the price is more

►Zscore identifies the number of standard deviations a

►A Zscore has no unit.

►Based on its standardized Zscore, a data value is classified

15K 28K 43K 50K 70K 90K

15K 28K 43K 50K 70K 90K

►If the data distribution is bell-shaped, symmetrical curve

68% 95% 95% 99.7%99.7%

► Regardless of how the data are distributed, at least (1 - 1/k2) of the

► Measures of relation position compare the position of one

► Percentiles are data that have been divided into 100

► For example, you EQ score in the 80rd percentile on a

► Generally, the pth percentile of a data set (where p is any

► Arrange the data in ascending order.

► Compute Lp, the location of the pth percentile.

► Example: The 80th percentile for the starting salary data

► Step 1. Arrange the data in ascending order.